Unverified Commit 7c9fbcf8 authored by PabloAgustin's avatar PabloAgustin Committed by GitHub
Browse files

New healthcare benchmark: careqa (#2714)



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: default avatarPabloAgustin <pablo.martin@bsc.es>
Co-authored-by: default avatarBaber <baber@hey.com>
parent 2c8ffb80
...@@ -21,6 +21,13 @@ def bypass_agg(arr): ...@@ -21,6 +21,13 @@ def bypass_agg(arr):
return 999 return 999
@register_aggregation("nanmean")
def nanmean(arr):
if len(arr) == 0 or all(np.isnan(arr)):
return np.nan
return np.nanmean(arr)
@register_aggregation("mean") @register_aggregation("mean")
def mean(arr): def mean(arr):
return sum(arr) / len(arr) return sum(arr) / len(arr)
...@@ -498,6 +505,7 @@ def stderr_for_metric(metric, bootstrap_iters: int): ...@@ -498,6 +505,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
bleu, bleu,
chrf, chrf,
ter, ter,
nanmean,
] ]
if metric in bootstrappable: if metric in bootstrappable:
......
This diff is collapsed.
# CareQA
### Paper
Title: `Automatic Evaluation of Healthcare LLMs Beyond Question-Answering`
Abstract: [https://arxiv.org/abs/2502.06666](https://arxiv.org/abs/2502.06666)
CareQA originates from the Spanish Specialised Healthcare Training (MIR) exams by the
Spanish Ministry of Health. The close-ended version is a multiple-choice question
answering (MCQA) including 5,621 QA pairs across six categories: medicine, nursing,
biology, chemistry, psychology, and pharmacology, sourced from the 2020 to 2024 exam
editions. CareQA is available in both English and Spanish. The open-ended version
(English only) contains 3,730 QA pairs.
Homepage: \
[https://huggingface.co/datasets/HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA)
#### Tasks
* `careqa_en`: MCQA in english.
* `careqa_es`: MCQA in spanish.
* `careqa_open`: Open-Ended QA in english.
* `careqa_open_perplexity`: Open-Ended QA in english, evaluated with perplexity.
### Citation
```bibtex
@misc{ariasduart2025automaticevaluationhealthcarellms,
title={Automatic Evaluation of Healthcare LLMs Beyond Question-Answering},
author={Anna Arias-Duart and Pablo Agustin Martin-Torres and Daniel Hinjos and Pablo Bernabeu-Perez and Lucia Urcelay Ganzabal and Marta Gonzalez Mallo and Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Sergio Alvarez-Napagao and Dario Garcia-Gasulla},
year={2025},
eprint={2502.06666},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.06666},
}
```
task: careqa_en
dataset_path: HPAI-BSC/CareQA
dataset_name: CareQA_en
test_split: test
output_type: multiple_choice
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
doc_to_choice: ['A', 'B', 'C', 'D']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
metadata:
version: 1.0
include: careqa_en.yaml
task: careqa_es
dataset_name: CareQA_es
task: careqa_open
dataset_path: HPAI-BSC/CareQA
dataset_name: CareQA_en_open
description: >
Instructions: The following text is a medical question. Answer it in the most factual, concise and informative way possible"
output_type: generate_until
test_split: test
doc_to_text: !function utils_open.doc_to_text
doc_to_target: !function utils_open.doc_to_target
process_results: !function utils_open.process_results_gen
generation_kwargs:
until:
- "\n\n"
metric_list:
- metric: bleu
aggregation: nanmean
higher_is_better: true
- metric: rouge1
aggregation: nanmean
higher_is_better: true
- metric: rouge2
aggregation: nanmean
higher_is_better: true
- metric: rougeL
aggregation: nanmean
higher_is_better: true
- metric: bleurt
aggregation: nanmean
higher_is_better: true
- metric: bert_score
aggregation: nanmean
higher_is_better: true
metadata:
version: 1.0
include: careqa_open.yaml
task: careqa_open_perplexity
output_type: loglikelihood_rolling
doc_to_text: ""
doc_to_target: !function utils_open.doc_to_target
process_results: !function utils_perplexity.process_results
metric_list:
- metric: word_perplexity
higher_is_better: false
- metric: byte_perplexity
higher_is_better: false
- metric: bits_per_byte
higher_is_better: false
metadata:
version: 1.0
generation_kwargs: null
def doc_to_text(doc) -> str:
"""
Question: <question>
Choices:
A. <choice1>
B. <choice2>
C. <choice3>
D. <choice4>
Answer:
"""
if doc["question"] is None:
doc = {
"question": "In relation to the immune mechanism involved in the rejection of transplanted solid organs, indicate the incorrect answer:",
"op1": "Acute T-cell mediated rejection can be controlled through the use of drugs such as cyclosporine A or corticosteroids.",
"exam_id": 36,
"op3": "Chronic rejection or chronic graft injury is associated with endothelial damage mediated by anti-HLA antibodies.",
"category": "Medicine",
"unique_id": "5636d1af-e0b1-43b0-8a04-6f127dcf6785",
"op4": "Hyperacute rejection is mediated by cytotoxic T lymphocytes against donor antigens present in the recipient.",
"op2": "The presence of specific antibodies against the donor (DSA) in the recipient prior to transplantation is a contraindication for it.",
"cop": 4,
"year": 2024,
}
choices = [doc["op1"], doc["op2"], doc["op3"], doc["op4"]]
option_choices = {
"A": choices[0],
"B": choices[1],
"C": choices[2],
"D": choices[3],
}
prompt = "Question: " + doc["question"] + "\nChoices:\n"
for choice, option in option_choices.items():
prompt += f"{choice.upper()}. {option}\n"
prompt += "Answer:"
return prompt
def doc_to_target(doc) -> int:
return doc["cop"] - 1
import numpy as np
try:
import evaluate
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")
bleurt = evaluate.load("bleurt", "bleurt-base-512", module_type="metric")
except (ModuleNotFoundError, ImportError):
raise ModuleNotFoundError(
"Please install evaluation metrics via pip install evaluate and pip install bert-score",
)
except Exception as e:
raise RuntimeError(
f"Error loading evaluation metrics: {str(e)}. Please check your installation."
)
def doc_eval(pred, refs):
try:
bleu_results = bleu.compute(predictions=pred, references=refs)
except Exception as e:
print(f"Bleu error: {e}")
bleu_results = {"bleu": np.NAN}
try:
rouge_results = rouge.compute(predictions=pred, references=refs)
except Exception as e:
print(f"Rouge error: {e}")
rouge_results = {"rouge1": np.NAN, "rouge2": np.NAN, "rougeL": np.NAN}
try:
bleurt_scores = bleurt.compute(predictions=pred, references=refs)["scores"]
except Exception as e:
print(f"Bleurt error: {e}")
bleurt_scores = [np.NAN]
try:
bert_scores = bertscore.compute(predictions=pred, references=refs, lang="en")[
"f1"
]
except Exception as e:
print(f"Bert error: {e}")
bert_scores = [np.NAN]
if bleu_results["bleu"] == 0:
# Sometimes bleu is 0.0 and this breaks the stderr computation.
bleu_results["bleu"] += 1e-5
results = {
"bleu": bleu_results["bleu"],
"rouge1": rouge_results["rouge1"],
"rouge2": rouge_results["rouge2"],
"rougeL": rouge_results["rougeL"],
"bleurt": np.mean(bleurt_scores),
"bert_score": np.mean(bert_scores),
}
return results
def doc_to_text(doc) -> str:
return doc["question"]
def doc_to_target(doc) -> str:
return doc["answer"]
def process_results_gen(doc, results):
pred, refs = [results[0]], [doc_to_target(doc)]
if len(refs[0]) < 1 or len(pred[0]) < 1:
return {
"bleu": np.NAN,
"rouge1": np.NAN,
"rouge2": np.NAN,
"rougeL": np.NAN,
"bleurt": np.NAN,
"bert_score": np.NAN,
}
results = doc_eval(pred, refs)
return {
"bleu": results["bleu"],
"rouge1": results["rouge1"],
"rouge2": results["rouge2"],
"rougeL": results["rougeL"],
"bleurt": results["bleurt"],
"bert_score": results["bert_score"],
}
def process_results_gen_w_repeats(doc, results):
pred, refs = [results[0]], [doc_to_target(doc)]
if len(refs[0]) < 1 or len(pred[0]) < 1:
return {
"bleu": np.NAN,
"rouge1": np.NAN,
"rouge2": np.NAN,
"rougeL": np.NAN,
"bleurt": np.NAN,
"bert_score": np.NAN,
}
results = doc_eval(pred, refs)
return {
"bleu": results["bleu"],
"rouge1": results["rouge1"],
"rouge2": results["rouge2"],
"rougeL": results["rougeL"],
"bleurt": results["bleurt"],
"bert_score": results["bert_score"],
}
import math
import re
def doc_to_target(doc) -> str:
return doc["answer"]
def process_results(doc, results):
(loglikelihood,) = results
_words = len(re.split(r"\s+", doc_to_target(doc)))
_bytes = len(doc_to_target(doc).encode("utf-8"))
print(f"perplexity: {math.exp(-loglikelihood / _words)}")
return {
"word_perplexity": (loglikelihood, _words),
"byte_perplexity": (loglikelihood, _bytes),
"bits_per_byte": (loglikelihood, _bytes),
}
group: med_prescriptions
task: med_prescriptions_easy
dataset_path: devlocalhost/prescription-full
output_type: multiple_choice
training_split: train
validation_split: train
test_split: train
process_docs: !function utils.process_docs
doc_to_text: !function utils.doc_to_text_easy
doc_to_choice: !function utils.doc_to_choice_easy
doc_to_target: !function utils.doc_to_target
generation_kwargs:
until:
- "\n\n"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
include: med_prescriptions_easy.yaml
task: med_prescriptions_hard
doc_to_text: !function utils.doc_to_text_hard
doc_to_choice: !function utils.doc_to_choice_hard
This diff is collapsed.
group: med_text_classification
task: med_text_classification_easy
dataset_path: csv
dataset_name: null
dataset_kwargs:
data_files:
train: /gpfs/projects/bsc70/heka/data/datasets/med_text_class_train.csv
output_type: multiple_choice
training_split: train
validation_split: train
test_split: train
process_docs: !function utils.process_docs
doc_to_text: !function utils.doc_to_text_easy
doc_to_choice: !function utils.doc_to_choice_easy
doc_to_target: !function utils.doc_to_target_easy
generation_kwargs:
until:
- "\n\n"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
include: med_text_classification_easy.yaml
task: med_text_classification_hard
dataset_kwargs:
data_files:
train: /gpfs/projects/bsc70/heka/data/datasets/mtsamples.csv
process_docs: !function utils.process_docs_hard
doc_to_text: !function utils.doc_to_text_hard
doc_to_choice: !function utils.doc_to_choice_hard
doc_to_target: !function utils.doc_to_target_hard
import random
import datasets
def process_docs_hard(dataset: datasets.Dataset):
return dataset
def process_docs(dataset: datasets.Dataset):
def _helper(doc):
return doc
num_entries = len(dataset)
ten_percent_index = int(0.1 * num_entries)
# Select the first 10% of the dataset
filtered_dataset = dataset.select(range(ten_percent_index))
return filtered_dataset.map(_helper)
def doc_to_choice_easy(doc):
return [
"neoplasms",
"digestive system diseases",
"nervous system diseases",
"cardiovascular diseases",
"general pathological conditions",
]
def doc_to_text_easy(doc) -> str:
choices = doc_to_choice_easy(doc)
prompt = (
"Classify the topic of the following medical text into one of the following choices. \n"
"Text: {} \n"
"Choices: \n"
"A. {} \n"
"B. {} \n"
"C. {} \n"
"D. {} \n"
"E. {} \n Answer:".format(
doc["text"], choices[0], choices[1], choices[2], choices[3], choices[4]
)
)
return prompt
def doc_to_target_easy(doc):
return int(doc["class"]) - 1
def doc_to_text_hard(doc) -> str:
choices = doc_to_choice_hard(doc)
prompt = (
"Select the medical specialty the following text is talking about among the following choices. \n"
"Text: {} \n"
"Choices: {}\n"
" Answer:".format(doc["transcription"], choices)
)
return prompt
def doc_to_choice_hard(doc):
choices_list = [
" Bariatrics",
" Allergy / Immunology",
" Dentistry",
" Cardiovascular / Pulmonary",
" Urology",
" Hospice - Palliative Care",
" Radiology",
" Pediatrics - Neonatal",
" Neurology",
" Neurosurgery",
" Emergency Room Reports",
" IME-QME-Work Comp etc.",
" Office Notes",
" Surgery",
" Letters",
" Ophthalmology",
" Hematology - Oncology",
" Endocrinology",
" Cosmetic / Plastic Surgery",
" Diets and Nutritions",
" Rheumatology",
" Nephrology",
" Physical Medicine - Rehab",
" Podiatry",
" Chiropractic",
" Lab Medicine - Pathology",
" Orthopedic",
" Autopsy",
" Psychiatry / Psychology",
" Speech - Language",
" ENT - Otolaryngology",
" Sleep Medicine",
" Dermatology",
" SOAP / Chart / Progress Notes",
" General Medicine",
" Consult - History and Phy.",
" Obstetrics / Gynecology",
" Gastroenterology",
" Pain Management",
" Discharge Summary",
]
return choices_list
def doc_to_target_hard(doc):
choices = doc_to_choice_hard(doc)
gold = doc["medical_specialty"]
idx = choices.index(gold)
return idx
# Meddialog
### Paper
Title: `MedDialog: Large-scale Medical Dialogue Datasets`
Abstract: [https://aclanthology.org/2020.emnlp-main.743/](https://aclanthology.org/2020.emnlp-main.743/)
This task contains the english version of the MedDialog Medical Dialogue Dataset, divided in two tasks:
question entailment, and open-ended Question Answering (QA).
#### Tasks
* `meddialog_qsumm`: Question entailment in english.
* `meddialog_qsumm_perplexity`: Question entailment in english, evaluated with perplexity.
* `meddialog_raw_dialogues`: Open-Ended QA in english.
* `meddialog_raw_perplexity`: Open-Ended QA in english, evaluated with perplexity.
### Citation
```bibtex
@inproceedings{zeng-etal-2020-meddialog,
title = "{M}ed{D}ialog: Large-scale Medical Dialogue Datasets",
author = "Zeng, Guangtao and
Yang, Wenmian and
Ju, Zeqian and
Yang, Yue and
Wang, Sicheng and
Zhang, Ruisi and
Zhou, Meng and
Zeng, Jiaqi and
Dong, Xiangyu and
Zhang, Ruoyu and
Fang, Hongchao and
Zhu, Penghui and
Chen, Shu and
Xie, Pengtao",
editor = "Webber, Bonnie and
Cohn, Trevor and
He, Yulan and
Liu, Yang",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.743/",
doi = "10.18653/v1/2020.emnlp-main.743",
pages = "9241--9250",
abstract = "Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build large-scale medical dialogue datasets {--} MedDialog, which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, MedDialog is the largest medical dialogue dataset to date. We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, GPT, BERT-GPT, and compare their performance. It is shown that models trained on MedDialog are able to generate clinically correct and doctor-like medical dialogues. We also study the transferability of models trained on MedDialog to low-resource medical dialogue generation tasks. It is shown that via transfer learning which finetunes the models pretrained on MedDialog, the performance on medical dialogue generation tasks with small datasets can be greatly improved, as shown in human evaluation and automatic evaluation. The datasets and code are available at \url{https://github.com/UCSD-AI4H/Medical-Dialogue-System}"
}
```
group: meddialog
include: meddialog_raw_dialogues.yaml
task: meddialog_qsumm
dataset_path: lighteval/med_dialog
dataset_name: icliniq
description: >
Instructions: The following text is contains a medical question. Extract and summarize the question.
output_type: generate_until
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.doc_to_text_qsumm
doc_to_target: !function utils.doc_to_target_qsumm
process_results: !function utils.process_results_gen_qsumm
include: meddialog_qsumm.yaml
task: meddialog_qsumm_perplexity
output_type: loglikelihood_rolling
doc_to_text: ""
process_results: !function utils_perplexity.process_results_qsumm
metric_list:
- metric: word_perplexity
higher_is_better: false
- metric: byte_perplexity
higher_is_better: false
- metric: bits_per_byte
higher_is_better: false
metadata:
version: 1.0
group: meddialog
task: meddialog_raw_dialogues
dataset_path: bigbio/meddialog
description: >
Instructions: The following text is from a collection of medical dialogues. What follows is the patients question. Answer how a doctor would, trying to be as helpful as possible.
output_type: generate_until
training_split: train
validation_split: train
test_split: train
doc_to_text: !function utils.doc_to_text_raw
doc_to_target: !function utils.doc_to_target_raw
process_results: !function utils.process_results_gen_raw
generation_kwargs:
until:
- "\n\n"
metric_list:
- metric: bleu
aggregation: nanmean
higher_is_better: true
- metric: rouge1
aggregation: nanmean
higher_is_better: true
- metric: rouge2
aggregation: nanmean
higher_is_better: true
- metric: rougeL
aggregation: nanmean
higher_is_better: true
- metric: bleurt
aggregation: nanmean
higher_is_better: true
- metric: bert_score
aggregation: nanmean
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment