Unverified Commit 885f48d6 authored by Yoav Katz's avatar Yoav Katz Committed by GitHub
Browse files

Initial integration of the Unitxt to LM eval harness (#1615)

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt



The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent d4a913c4
......@@ -443,6 +443,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
| sentencepiece | For using the sentencepiece tokenizer |
| sparseml | For using NM's SparseML models |
| testing | For running library test suite |
| unitxt | For IBM's unitxt dataset tasks |
| vllm | For loading models with vLLM |
| zeno | For visualizing results with Zeno |
|---------------|---------------------------------------|
......
include: unitxt_tasks.classification.multi_class
task: 20_newsgroups
dataset_name: card=cards.20_newsgroups,template=templates.classification.multi_class.title
# Unitxt
### Paper
Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
Abstract: `https://arxiv.org/abs/2401.14019`
Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`
Homepage: https://unitxt.readthedocs.io/en/latest/index.html
### Citation
```
@misc{unitxt,
title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
year={2024},
eprint={2401.14019},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* `unitxt`: Subset of Unitxt tasks that were not in LM-Eval Harness task catalog, including new types of tasks like multi-label classification, grammatical error correction, named entity extraction.
#### Tasks
The full list of Unitxt tasks currently supported can be seen under `tasks/unitxt` directory.
### Adding tasks
You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.
The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.
To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
1. Add the card name to the `unitxt_datasets` file in the `tasks/unitxt` directory.
2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary. If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.
```
default_template_per_task = {
"tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
"tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
"tasks.summarization.abstractive" : "templates.summarization.abstractive.full",
"tasks.regression.two_texts" : "templates.regression.two_texts.simple",
"tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
"tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
"tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
}
```
3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)
If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:
https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
include: unitxt_tasks.classification.multi_class
task: ag_news
dataset_name: card=cards.ag_news,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: argument_topic
dataset_name: card=cards.argument_topic,template=templates.classification.multi_class.title
include: unitxt_tasks.span_labeling.extraction
task: atis
dataset_name: card=cards.atis,template=templates.span_labeling.extraction.title
include: unitxt_tasks.classification.multi_class
task: banking77
dataset_name: card=cards.banking77,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: claim_stance_topic
dataset_name: card=cards.claim_stance_topic,template=templates.classification.multi_class.title
include: unitxt_tasks.summarization.abstractive
task: cnn_dailymail
dataset_name: card=cards.cnn_dailymail,template=templates.summarization.abstractive.full
include: unitxt_tasks.grammatical_error_correction
task: coedit_gec
dataset_name: card=cards.coedit_gec,template=templates.grammatical_error_correction.simple
include: unitxt_tasks.classification.multi_class
task: dbpedia_14
dataset_name: card=cards.dbpedia_14,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: ethos_binary
dataset_name: card=cards.ethos_binary,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: financial_tweets
dataset_name: card=cards.financial_tweets,template=templates.classification.multi_class.title
#
# This file generates a set of LM eval harness yaml file
# that load unitxt datasets (https://github.com/IBM/unitxt)
#
import unitxt_wrapper
import yaml
from unitxt.artifact import fetch_artifact
from unitxt.standard import StandardRecipe
# This code is required to properly dump LM harness YAML that contains references to functions
def function_representer(dumper: yaml.SafeDumper, func) -> yaml.nodes.MappingNode:
return dumper.represent_scalar(
"!function", f"{func.__module__}.{func.__name__}", style=None
)
def write_task_yaml(filename, data):
yaml.add_representer(type(data["process_results"]), function_representer)
with open(filename, "w") as stream:
yaml.dump(data, stream, sort_keys=False)
def write_card_yaml(filename, data):
with open(filename, "w") as stream:
yaml.dump(data, stream, sort_keys=False)
default_template_per_task = {
"tasks.classification.multi_label": "templates.classification.multi_label.title",
"tasks.classification.multi_class": "templates.classification.multi_class.title",
"tasks.summarization.abstractive": "templates.summarization.abstractive.full",
"tasks.regression.two_texts": "templates.regression.two_texts.simple",
"tasks.qa.with_context.extractive": "templates.qa.with_context.simple",
"tasks.grammatical_error_correction": "templates.grammatical_error_correction.simple",
"tasks.span_labeling.extraction": "templates.span_labeling.extraction.title",
}
def generate_task_yaml(task: str):
"""
Generate an LM Eval Harness YAML file based on a Unitxt task defintion.
The output YAML is based on 'template.yaml.file' found in current directoy.
The common template is filled the the specific metrics for the task.
It still leaves the 'dataset_name' and 'task name' unspecified.
"""
print("*" * 80)
print("*")
print(f"* Generating YAML base file for task {task}")
print("*")
task_definition, _ = fetch_artifact(task)
data = {
"group": ["unitxt"],
"dataset_path": "unitxt/data",
"output_type": "generate_until",
"training_split": "train",
"validation_split": "test",
"doc_to_text": "{{source}}",
"doc_to_target": "target",
"process_results": unitxt_wrapper.process_results,
"generation_kwargs": {"until": ["</s>"]},
"metric_list": [],
"metadata": {"verison": 1.0},
}
for metric_name in task_definition.metrics:
new_metric = {"metric": "", "aggregation": "unitxt", "higher_is_better": True}
new_metric["metric"] = metric_name.replace("metrics.", "unitxt_")
data["metric_list"].append(new_metric)
write_task_yaml(f"unitxt_{task}", data)
def generate_card_yaml(card: str):
"""
Generate an LM Eval Harness YAML file based on the Unitxt dataset card.
It includes the task YAML for the dataset, and overrides the 'dataset_name' and 'task' with the card.
"""
print("*" * 80)
print("*")
print(f"* Generating YAML file for unitxt dataset {card}")
print("*")
card_definition, _ = fetch_artifact(f"cards.{card}")
task = card_definition.task.__id__
if task in default_template_per_task:
template = default_template_per_task[task]
else:
raise ValueError(
f"Default template was not defined for task {task} in 'default_template_per_task' dict in generate_yamls.py"
)
data = {}
data["include"] = f"unitxt_{task}"
data["task"] = card
data["dataset_name"] = f"card=cards.{card},template={template}"
# This is faster that the load_dataset approach
# dataset = load_dataset('unitxt/data', data["dataset_name"]+",loader_limit=100",trust_remote_code=True)
recipe = StandardRecipe(card=f"cards.{card}", template=template, loader_limit=100)
stream = recipe()
dataset = stream.to_dataset()
print(dataset)
print("Sample input:")
print(dataset["test"][0]["source"])
print("Sample output:")
print(dataset["test"][0]["target"])
write_card_yaml(f"{card}.yaml", data)
def main():
for task in default_template_per_task.keys():
try:
generate_task_yaml(task)
except Exception as e:
print(f"Unable to generate YAML for {task} due to:")
print(e)
raise (e)
with open("unitxt_datasets") as f:
for unitxt_dataset in f:
unitxt_dataset = unitxt_dataset.strip()
if unitxt_dataset.startswith("### END ###"):
exit(0)
if not unitxt_dataset.startswith("#"):
try:
generate_card_yaml(unitxt_dataset)
except Exception as e:
print(f"Unable to generate YAML for {unitxt_dataset} due to:")
print(e)
raise e
if __name__ == "__main__":
main()
include: unitxt_tasks.classification.multi_class
task: law_stack_exchange
dataset_name: card=cards.law_stack_exchange,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: ledgar
dataset_name: card=cards.ledgar,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: medical_abstracts
dataset_name: card=cards.medical_abstracts,template=templates.classification.multi_class.title
include: unitxt_tasks.regression.two_texts
task: stsb
dataset_name: card=cards.stsb,template=templates.regression.two_texts.simple
include: unitxt_tasks.classification.multi_label
task: unfair_tos
dataset_name: card=cards.unfair_tos,template=templates.classification.multi_label.title
coedit_gec
atis
20_newsgroups
ag_news
argument_topic
banking77
claim_stance_topic
cnn_dailymail
dbpedia_14
ethos_binary
financial_tweets
law_stack_exchange
ledgar
medical_abstracts
stsb
unfair_tos
xsum
yahoo_answers_topics
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment