Unverified Commit 885f48d6 authored by Yoav Katz's avatar Yoav Katz Committed by GitHub
Browse files

Initial integration of the Unitxt to LM eval harness (#1615)

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt



The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent d4a913c4
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_f1_micro
aggregation: unitxt
higher_is_better: true
- metric: unitxt_accuracy
aggregation: unitxt
higher_is_better: true
- metric: unitxt_f1_macro
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_f1_micro_multi_label
aggregation: unitxt
higher_is_better: true
- metric: unitxt_accuracy
aggregation: unitxt
higher_is_better: true
- metric: unitxt_f1_macro_multi_label
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_char_edit_dist_accuracy
aggregation: unitxt
higher_is_better: true
- metric: unitxt_rouge
aggregation: unitxt
higher_is_better: true
- metric: unitxt_char_edit_distance[reference_field=original_text]
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_squad
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_spearman
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_ner
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
group:
- unitxt
dataset_path: unitxt/data
output_type: generate_until
training_split: train
validation_split: test
doc_to_text: '{{source}}'
doc_to_target: target
process_results: !function 'unitxt_wrapper.process_results'
generation_kwargs:
until:
- </s>
metric_list:
- metric: unitxt_rouge
aggregation: unitxt
higher_is_better: true
metadata:
verison: 1.0
try:
from unitxt import evaluate
except ImportError:
raise ImportError(
"Package 'unitxt' is not installed. To install it, use `pip install 'lm_eval[unitxt]'`"
)
from lm_eval.api.registry import AGGREGATION_REGISTRY, METRIC_REGISTRY, register_metric
def unitxt_agg_metric(items):
preds = [pred[0] for pred, _, _ in items]
refs = [ref for _, ref, _ in items]
metric_name = items[0][2].replace("unitxt_", "metrics.")
for ref in refs:
ref["metrics"] = [metric_name]
result_metrics = evaluate(preds, refs)
return result_metrics[0]["score"]["global"]["score"]
AGGREGATION_REGISTRY["unitxt"] = unitxt_agg_metric
def unitxt_metric(items): # This is a passthrough function
return items
def process_results(doc, results):
metrics = doc["metrics"]
scores = {}
for metric in metrics:
metric = metric.replace("metrics.", "unitxt_")
scores[metric] = (results, doc, metric)
if metric not in METRIC_REGISTRY:
register_metric(
metric=metric,
higher_is_better=True,
output_type="generate_until",
aggregation="unitxt",
)(unitxt_metric)
return scores
#
include: unitxt_tasks.summarization.abstractive
task: xsum
dataset_name: card=cards.xsum,template=templates.summarization.abstractive.full
include: unitxt_tasks.classification.multi_class
task: yahoo_answers_topics
dataset_name: card=cards.yahoo_answers_topics,template=templates.classification.multi_class.title
......@@ -76,6 +76,7 @@ testing = ["pytest", "pytest-cov", "pytest-xdist"]
vllm = ["vllm==0.3.2"]
zeno = ["pandas", "zeno-client"]
wandb = ["wandb>=0.16.3", "pandas", "numpy"]
unitxt = ["unitxt"]
all = [
"lm_eval[anthropic]",
"lm_eval[dev]",
......@@ -94,6 +95,7 @@ all = [
"lm_eval[vllm]",
"lm_eval[zeno]",
"lm_eval[wandb]",
"lm_eval[unitxt]"
]
[tool.ruff.lint]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment