Initial integration of the Unitxt to LM eval harness (#1615)

* Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Initial integration of the Unitxt to LM eval harness (#1615)
* Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
885f48d6 · Yoav Katz · GitHub · d4a913c4 · 885f48d6 · 885f48d6
Unverified Commit 885f48d6 authored May 08, 2024 by Yoav Katz Committed by GitHub May 07, 2024
20 changed files
--- a/README.md
+++ b/README.md
@@ -443,6 +443,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | sentencepiece | For using the sentencepiece tokenizer |
 | sparseml      | For using NM's SparseML models        |
 | testing       | For running library test suite        |
+| unitxt        | For IBM's unitxt dataset tasks        |
 | vllm          | For loading models with vLLM          |
 | zeno          | For visualizing results with Zeno     |
 |---------------|---------------------------------------|

--- a/lm_eval/tasks/unitxt/20_newsgroups.yaml
+++ b/lm_eval/tasks/unitxt/20_newsgroups.yaml
+include: unitxt_tasks.classification.multi_class
+task: 20_newsgroups
+dataset_name: card=cards.20_newsgroups,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/README.md
+++ b/lm_eval/tasks/unitxt/README.md
+# Unitxt
+
+### Paper
+
+Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
+Abstract: `https://arxiv.org/abs/2401.14019`
+
+Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
+
+The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`
+
+Homepage: https://unitxt.readthedocs.io/en/latest/index.html
+
+### Citation
+
+```
+@misc{unitxt,
+      title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
+      author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
+      year={2024},
+      eprint={2401.14019},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `unitxt`:  Subset of Unitxt tasks that were not in LM-Eval Harness task catalog, including new types of tasks like multi-label classification, grammatical error correction, named entity extraction.
+
+#### Tasks
+
+The full list of Unitxt tasks currently supported can be seen under `tasks/unitxt` directory.
+
+### Adding tasks
+
+You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.
+
+The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.
+
+To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
+1. Add the card name to the `unitxt_datasets`  file in the `tasks/unitxt` directory.  
+2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary.  If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.  
+
+```
+default_template_per_task = {
+     "tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
+     "tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
+     "tasks.summarization.abstractive" :  "templates.summarization.abstractive.full",
+     "tasks.regression.two_texts" : "templates.regression.two_texts.simple",
+     "tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
+     "tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
+     "tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
+}
+```
+3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)
+
+If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:
+
+https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html
+
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/unitxt/ag_news.yaml
+++ b/lm_eval/tasks/unitxt/ag_news.yaml
+include: unitxt_tasks.classification.multi_class
+task: ag_news
+dataset_name: card=cards.ag_news,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/argument_topic.yaml
+++ b/lm_eval/tasks/unitxt/argument_topic.yaml
+include: unitxt_tasks.classification.multi_class
+task: argument_topic
+dataset_name: card=cards.argument_topic,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/atis.yaml
+++ b/lm_eval/tasks/unitxt/atis.yaml
+include: unitxt_tasks.span_labeling.extraction
+task: atis
+dataset_name: card=cards.atis,template=templates.span_labeling.extraction.title
--- a/lm_eval/tasks/unitxt/banking77.yaml
+++ b/lm_eval/tasks/unitxt/banking77.yaml
+include: unitxt_tasks.classification.multi_class
+task: banking77
+dataset_name: card=cards.banking77,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/claim_stance_topic.yaml
+++ b/lm_eval/tasks/unitxt/claim_stance_topic.yaml
+include: unitxt_tasks.classification.multi_class
+task: claim_stance_topic
+dataset_name: card=cards.claim_stance_topic,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/cnn_dailymail.yaml
+++ b/lm_eval/tasks/unitxt/cnn_dailymail.yaml
+include: unitxt_tasks.summarization.abstractive
+task: cnn_dailymail
+dataset_name: card=cards.cnn_dailymail,template=templates.summarization.abstractive.full
--- a/lm_eval/tasks/unitxt/coedit_gec.yaml
+++ b/lm_eval/tasks/unitxt/coedit_gec.yaml
+include: unitxt_tasks.grammatical_error_correction
+task: coedit_gec
+dataset_name: card=cards.coedit_gec,template=templates.grammatical_error_correction.simple
--- a/lm_eval/tasks/unitxt/dbpedia_14.yaml
+++ b/lm_eval/tasks/unitxt/dbpedia_14.yaml
+include: unitxt_tasks.classification.multi_class
+task: dbpedia_14
+dataset_name: card=cards.dbpedia_14,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/ethos_binary.yaml
+++ b/lm_eval/tasks/unitxt/ethos_binary.yaml
+include: unitxt_tasks.classification.multi_class
+task: ethos_binary
+dataset_name: card=cards.ethos_binary,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/financial_tweets.yaml
+++ b/lm_eval/tasks/unitxt/financial_tweets.yaml
+include: unitxt_tasks.classification.multi_class
+task: financial_tweets
+dataset_name: card=cards.financial_tweets,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/generate_yamls.py
+++ b/lm_eval/tasks/unitxt/generate_yamls.py
+#
+# This file generates a set of LM eval harness yaml file
+# that load unitxt datasets (https://github.com/IBM/unitxt)
+#
+
+import unitxt_wrapper
+import yaml
+from unitxt.artifact import fetch_artifact
+from unitxt.standard import StandardRecipe
+
+
+# This code is required to properly dump LM harness YAML that contains references to functions
+def function_representer(dumper: yaml.SafeDumper, func) -> yaml.nodes.MappingNode:
+    return dumper.represent_scalar(
+        "!function", f"{func.__module__}.{func.__name__}", style=None
+    )
+
+
+def write_task_yaml(filename, data):
+    yaml.add_representer(type(data["process_results"]), function_representer)
+    with open(filename, "w") as stream:
+        yaml.dump(data, stream, sort_keys=False)
+
+
+def write_card_yaml(filename, data):
+    with open(filename, "w") as stream:
+        yaml.dump(data, stream, sort_keys=False)
+
+
+default_template_per_task = {
+    "tasks.classification.multi_label": "templates.classification.multi_label.title",
+    "tasks.classification.multi_class": "templates.classification.multi_class.title",
+    "tasks.summarization.abstractive": "templates.summarization.abstractive.full",
+    "tasks.regression.two_texts": "templates.regression.two_texts.simple",
+    "tasks.qa.with_context.extractive": "templates.qa.with_context.simple",
+    "tasks.grammatical_error_correction": "templates.grammatical_error_correction.simple",
+    "tasks.span_labeling.extraction": "templates.span_labeling.extraction.title",
+}
+
+
+def generate_task_yaml(task: str):
+    """
+    Generate an LM Eval Harness YAML file based on a Unitxt task defintion.
+    The output YAML is based on 'template.yaml.file' found in current directoy.
+
+    The common template is filled the the specific metrics for the task.
+    It still leaves the 'dataset_name' and 'task name' unspecified.
+    """
+    print("*" * 80)
+    print("*")
+    print(f"* Generating YAML base file for task {task}")
+    print("*")
+    task_definition, _ = fetch_artifact(task)
+    data = {
+        "group": ["unitxt"],
+        "dataset_path": "unitxt/data",
+        "output_type": "generate_until",
+        "training_split": "train",
+        "validation_split": "test",
+        "doc_to_text": "{{source}}",
+        "doc_to_target": "target",
+        "process_results": unitxt_wrapper.process_results,
+        "generation_kwargs": {"until": ["</s>"]},
+        "metric_list": [],
+        "metadata": {"verison": 1.0},
+    }
+
+    for metric_name in task_definition.metrics:
+        new_metric = {"metric": "", "aggregation": "unitxt", "higher_is_better": True}
+        new_metric["metric"] = metric_name.replace("metrics.", "unitxt_")
+        data["metric_list"].append(new_metric)
+
+    write_task_yaml(f"unitxt_{task}", data)
+
+
+def generate_card_yaml(card: str):
+    """
+    Generate an LM Eval Harness YAML file based on the Unitxt dataset card.
+    It includes the task YAML for the dataset, and overrides the 'dataset_name' and 'task' with the card.
+    """
+
+    print("*" * 80)
+    print("*")
+    print(f"* Generating YAML file for unitxt dataset {card}")
+    print("*")
+
+    card_definition, _ = fetch_artifact(f"cards.{card}")
+    task = card_definition.task.__id__
+    if task in default_template_per_task:
+        template = default_template_per_task[task]
+    else:
+        raise ValueError(
+            f"Default template was not defined for task {task} in 'default_template_per_task' dict in generate_yamls.py"
+        )
+    data = {}
+    data["include"] = f"unitxt_{task}"
+    data["task"] = card
+    data["dataset_name"] = f"card=cards.{card},template={template}"
+    # This is faster that the load_dataset approach
+    # dataset = load_dataset('unitxt/data',  data["dataset_name"]+",loader_limit=100",trust_remote_code=True)
+    recipe = StandardRecipe(card=f"cards.{card}", template=template, loader_limit=100)
+    stream = recipe()
+    dataset = stream.to_dataset()
+    print(dataset)
+    print("Sample input:")
+    print(dataset["test"][0]["source"])
+    print("Sample output:")
+    print(dataset["test"][0]["target"])
+    write_card_yaml(f"{card}.yaml", data)
+
+
+def main():
+    for task in default_template_per_task.keys():
+        try:
+            generate_task_yaml(task)
+        except Exception as e:
+            print(f"Unable to generate YAML for {task} due to:")
+            print(e)
+            raise (e)
+    with open("unitxt_datasets") as f:
+        for unitxt_dataset in f:
+            unitxt_dataset = unitxt_dataset.strip()
+            if unitxt_dataset.startswith("### END ###"):
+                exit(0)
+            if not unitxt_dataset.startswith("#"):
+                try:
+                    generate_card_yaml(unitxt_dataset)
+                except Exception as e:
+                    print(f"Unable to generate YAML for {unitxt_dataset} due to:")
+                    print(e)
+                    raise e
+
+
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/unitxt/law_stack_exchange.yaml
+++ b/lm_eval/tasks/unitxt/law_stack_exchange.yaml
+include: unitxt_tasks.classification.multi_class
+task: law_stack_exchange
+dataset_name: card=cards.law_stack_exchange,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/ledgar.yaml
+++ b/lm_eval/tasks/unitxt/ledgar.yaml
+include: unitxt_tasks.classification.multi_class
+task: ledgar
+dataset_name: card=cards.ledgar,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/medical_abstracts.yaml
+++ b/lm_eval/tasks/unitxt/medical_abstracts.yaml
+include: unitxt_tasks.classification.multi_class
+task: medical_abstracts
+dataset_name: card=cards.medical_abstracts,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/stsb.yaml
+++ b/lm_eval/tasks/unitxt/stsb.yaml
+include: unitxt_tasks.regression.two_texts
+task: stsb
+dataset_name: card=cards.stsb,template=templates.regression.two_texts.simple
--- a/lm_eval/tasks/unitxt/unfair_tos.yaml
+++ b/lm_eval/tasks/unitxt/unfair_tos.yaml
+include: unitxt_tasks.classification.multi_label
+task: unfair_tos
+dataset_name: card=cards.unfair_tos,template=templates.classification.multi_label.title
--- a/lm_eval/tasks/unitxt/unitxt_datasets
+++ b/lm_eval/tasks/unitxt/unitxt_datasets
+coedit_gec
+atis
+20_newsgroups
+ag_news
+argument_topic
+banking77
+claim_stance_topic
+cnn_dailymail
+dbpedia_14
+ethos_binary
+financial_tweets
+law_stack_exchange
+ledgar
+medical_abstracts
+stsb
+unfair_tos
+xsum
+yahoo_answers_topics