Merge branch 'main' into inverse-scaling-tasks

60c9c170 · haileyschoelkopf · 4b2d565b · b4cd85d4 · 60c9c170 · 60c9c170
Commit 60c9c170 authored May 29, 2024 by haileyschoelkopf
20 changed files
--- a/lm_eval/tasks/tmmluplus/subject.tsv
+++ b/lm_eval/tasks/tmmluplus/subject.tsv
+subject	name	category
+dentistry	牙醫學	health
+traditional_chinese_medicine_clinical_medicine	中醫臨床醫學	health
+clinical_psychology	臨床心理學	psychology
+technical	技術工相關	other
+culinary_skills	餐旅	other
+mechanical	機械與機電概論	other
+logic_reasoning	邏輯思維	other
+real_estate	房地產	other
+general_principles_of_law	法學大意	law
+finance_banking	金融與法規	business
+anti_money_laundering	洗錢防制	law
+ttqav2	台灣在地用語	culture
+marketing_management	行銷管理	other
+business_management	企業管理	other
+organic_chemistry	有機化學	chemistry
+advance_chemistry	化學	chemistry
+physics	物理	physics
+secondary_physics	高中物理	physics
+human_behavior	人類行為與社會	psychology
+national_protection	軍事	politics
+jce_humanities	指考人文科目	philosophy
+linear_algebra	線代	math
+politic_science	政治	politics
+agriculture	農業	other
+official_document_management	機關文書	other
+financial_analysis	財務分析	business
+pharmacy	藥劑學	biology
+educational_psychology	教育心理	psychology
+statistics_and_machine_learning	統計與機器學習	engineering
+management_accounting	管理會計	business
+introduction_to_law	法律概論	law
+computer_science	資訊工程	computer science
+veterinary_pathology	獸醫病理學	health
+accounting	會計學	business
+fire_science	火災學	other
+optometry	視光學	other
+insurance_studies	保險學	other
+pharmacology	藥理學	health
+taxation	稅務	law
+education_(profession_level)	教育專業	education
+economics	經濟學	economics
+veterinary_pharmacology	獸醫藥理學	health
+nautical_science	航海	other
+occupational_therapy_for_psychological_disorders	心理障礙職能治療學	psychology
+trust_practice	信託實務	law
+geography_of_taiwan	台灣地理	geography
+physical_education	體育	education
+auditing	審計學	business
+administrative_law	行政法	law
+basic_medical_science	基礎醫學	biology
+macroeconomics	總經	economics
+trade	貿易	business
+chinese_language_and_literature	國文	culture
+tve_design	統測＿設計	other
+junior_science_exam	國中會考基測自然科	biology
+junior_math_exam	國中會考基測數學科	math
+junior_chinese_exam	國中會考基測國文	culture
+junior_social_studies	國中會考基測社會科	other
+tve_mathematics	統測數學	math
+tve_chinese_language	統測國文	culture
+tve_natural_sciences	統測自然科	biology
+junior_chemistry	國中理化	chemistry
+music	音樂科	other
+education	教育常識	education
+three_principles_of_people	三民主義	culture
+taiwanese_hokkien	閩南語	culture
+engineering_math	工程數學	math
--- a/lm_eval/tasks/unitxt/20_newsgroups.yaml
+++ b/lm_eval/tasks/unitxt/20_newsgroups.yaml
+include: unitxt_tasks.classification.multi_class
+task: 20_newsgroups
+dataset_name: card=cards.20_newsgroups,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/README.md
+++ b/lm_eval/tasks/unitxt/README.md
+# Unitxt
+### Paper
+Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
+Abstract: `https://arxiv.org/abs/2401.14019`
+Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
+The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`
+Homepage: https://unitxt.readthedocs.io/en/latest/index.html
+### Citation
+```
+@misc{unitxt,
+      title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
+      author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
+      year={2024},
+      eprint={2401.14019},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+### Groups and Tasks
+#### Groups
+* `unitxt`:  Subset of Unitxt tasks that were not in LM-Eval Harness task catalog, including new types of tasks like multi-label classification, grammatical error correction, named entity extraction.
+#### Tasks
+The full list of Unitxt tasks currently supported can be seen under `tasks/unitxt` directory.
+### Adding tasks
+You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.
+The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.
+To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
+1. Add the card name to the `unitxt_datasets`  file in the `tasks/unitxt` directory.  
+2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary.  If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.  
+```
+default_template_per_task = {
+     "tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
+     "tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
+     "tasks.summarization.abstractive" :  "templates.summarization.abstractive.full",
+     "tasks.regression.two_texts" : "templates.regression.two_texts.simple",
+     "tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
+     "tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
+     "tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
+}
+```
+3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)
+If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:
+https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/unitxt/ag_news.yaml
+++ b/lm_eval/tasks/unitxt/ag_news.yaml
+include: unitxt_tasks.classification.multi_class
+task: ag_news
+dataset_name: card=cards.ag_news,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/argument_topic.yaml
+++ b/lm_eval/tasks/unitxt/argument_topic.yaml
+include: unitxt_tasks.classification.multi_class
+task: argument_topic
+dataset_name: card=cards.argument_topic,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/atis.yaml
+++ b/lm_eval/tasks/unitxt/atis.yaml
+include: unitxt_tasks.span_labeling.extraction
+task: atis
+dataset_name: card=cards.atis,template=templates.span_labeling.extraction.title
--- a/lm_eval/tasks/unitxt/banking77.yaml
+++ b/lm_eval/tasks/unitxt/banking77.yaml
+include: unitxt_tasks.classification.multi_class
+task: banking77
+dataset_name: card=cards.banking77,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/claim_stance_topic.yaml
+++ b/lm_eval/tasks/unitxt/claim_stance_topic.yaml
+include: unitxt_tasks.classification.multi_class
+task: claim_stance_topic
+dataset_name: card=cards.claim_stance_topic,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/cnn_dailymail.yaml
+++ b/lm_eval/tasks/unitxt/cnn_dailymail.yaml
+include: unitxt_tasks.summarization.abstractive
+task: cnn_dailymail
+dataset_name: card=cards.cnn_dailymail,template=templates.summarization.abstractive.full
--- a/lm_eval/tasks/unitxt/coedit_gec.yaml
+++ b/lm_eval/tasks/unitxt/coedit_gec.yaml
+include: unitxt_tasks.grammatical_error_correction
+task: coedit_gec
+dataset_name: card=cards.coedit_gec,template=templates.grammatical_error_correction.simple
--- a/lm_eval/tasks/unitxt/dbpedia_14.yaml
+++ b/lm_eval/tasks/unitxt/dbpedia_14.yaml
+include: unitxt_tasks.classification.multi_class
+task: dbpedia_14
+dataset_name: card=cards.dbpedia_14,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/ethos_binary.yaml
+++ b/lm_eval/tasks/unitxt/ethos_binary.yaml
+include: unitxt_tasks.classification.multi_class
+task: ethos_binary
+dataset_name: card=cards.ethos_binary,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/financial_tweets.yaml
+++ b/lm_eval/tasks/unitxt/financial_tweets.yaml
+include: unitxt_tasks.classification.multi_class
+task: financial_tweets
+dataset_name: card=cards.financial_tweets,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/generate_yamls.py
+++ b/lm_eval/tasks/unitxt/generate_yamls.py
+#
+# This file generates a set of LM eval harness yaml file
+# that load unitxt datasets (https://github.com/IBM/unitxt)
+#
+import unitxt_wrapper
+import yaml
+from unitxt.artifact import fetch_artifact
+from unitxt.standard import StandardRecipe
+# This code is required to properly dump LM harness YAML that contains references to functions
+def function_representer(dumper: yaml.SafeDumper, func) -> yaml.nodes.MappingNode:
+    return dumper.represent_scalar(
+        "!function", f"{func.__module__}.{func.__name__}", style=None
+    )
+def write_task_yaml(filename, data):
+    yaml.add_representer(type(data["process_results"]), function_representer)
+    with open(filename, "w") as stream:
+        yaml.dump(data, stream, sort_keys=False)
+def write_card_yaml(filename, data):
+    with open(filename, "w") as stream:
+        yaml.dump(data, stream, sort_keys=False)
+default_template_per_task = {
+    "tasks.classification.multi_label": "templates.classification.multi_label.title",
+    "tasks.classification.multi_class": "templates.classification.multi_class.title",
+    "tasks.summarization.abstractive": "templates.summarization.abstractive.full",
+    "tasks.regression.two_texts": "templates.regression.two_texts.simple",
+    "tasks.qa.with_context.extractive": "templates.qa.with_context.simple",
+    "tasks.grammatical_error_correction": "templates.grammatical_error_correction.simple",
+    "tasks.span_labeling.extraction": "templates.span_labeling.extraction.title",
+}
+def generate_task_yaml(task: str):
+    """
+    Generate an LM Eval Harness YAML file based on a Unitxt task defintion.
+    The output YAML is based on 'template.yaml.file' found in current directoy.
+    The common template is filled the the specific metrics for the task.
+    It still leaves the 'dataset_name' and 'task name' unspecified.
+    """
+    print("*" * 80)
+    print("*")
+    print(f"* Generating YAML base file for task {task}")
+    print("*")
+    task_definition, _ = fetch_artifact(task)
+    data = {
+        "group": ["unitxt"],
+        "dataset_path": "unitxt/data",
+        "output_type": "generate_until",
+        "training_split": "train",
+        "validation_split": "test",
+        "doc_to_text": "{{source}}",
+        "doc_to_target": "target",
+        "process_results": unitxt_wrapper.process_results,
+        "generation_kwargs": {"until": ["</s>"]},
+        "metric_list": [],
+        "metadata": {"verison": 1.0},
+    }
+    for metric_name in task_definition.metrics:
+        new_metric = {"metric": "", "aggregation": "unitxt", "higher_is_better": True}
+        new_metric["metric"] = metric_name.replace("metrics.", "unitxt_")
+        data["metric_list"].append(new_metric)
+    write_task_yaml(f"unitxt_{task}", data)
+def generate_card_yaml(card: str):
+    """
+    Generate an LM Eval Harness YAML file based on the Unitxt dataset card.
+    It includes the task YAML for the dataset, and overrides the 'dataset_name' and 'task' with the card.
+    """
+    print("*" * 80)
+    print("*")
+    print(f"* Generating YAML file for unitxt dataset {card}")
+    print("*")
+    card_definition, _ = fetch_artifact(f"cards.{card}")
+    task = card_definition.task.__id__
+    if task in default_template_per_task:
+        template = default_template_per_task[task]
+    else:
+        raise ValueError(
+            f"Default template was not defined for task {task} in 'default_template_per_task' dict in generate_yamls.py"
+        )
+    data = {}
+    data["include"] = f"unitxt_{task}"
+    data["task"] = card
+    data["dataset_name"] = f"card=cards.{card},template={template}"
+    # This is faster that the load_dataset approach
+    # dataset = load_dataset('unitxt/data',  data["dataset_name"]+",loader_limit=100",trust_remote_code=True)
+    recipe = StandardRecipe(card=f"cards.{card}", template=template, loader_limit=100)
+    stream = recipe()
+    dataset = stream.to_dataset()
+    print(dataset)
+    print("Sample input:")
+    print(dataset["test"][0]["source"])
+    print("Sample output:")
+    print(dataset["test"][0]["target"])
+    write_card_yaml(f"{card}.yaml", data)
+def main():
+    for task in default_template_per_task.keys():
+        try:
+            generate_task_yaml(task)
+        except Exception as e:
+            print(f"Unable to generate YAML for {task} due to:")
+            print(e)
+            raise (e)
+    with open("unitxt_datasets") as f:
+        for unitxt_dataset in f:
+            unitxt_dataset = unitxt_dataset.strip()
+            if unitxt_dataset.startswith("### END ###"):
+                exit(0)
+            if not unitxt_dataset.startswith("#"):
+                try:
+                    generate_card_yaml(unitxt_dataset)
+                except Exception as e:
+                    print(f"Unable to generate YAML for {unitxt_dataset} due to:")
+                    print(e)
+                    raise e
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/unitxt/law_stack_exchange.yaml
+++ b/lm_eval/tasks/unitxt/law_stack_exchange.yaml
+include: unitxt_tasks.classification.multi_class
+task: law_stack_exchange
+dataset_name: card=cards.law_stack_exchange,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/ledgar.yaml
+++ b/lm_eval/tasks/unitxt/ledgar.yaml
+include: unitxt_tasks.classification.multi_class
+task: ledgar
+dataset_name: card=cards.ledgar,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/medical_abstracts.yaml
+++ b/lm_eval/tasks/unitxt/medical_abstracts.yaml
+include: unitxt_tasks.classification.multi_class
+task: medical_abstracts
+dataset_name: card=cards.medical_abstracts,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/stsb.yaml
+++ b/lm_eval/tasks/unitxt/stsb.yaml
+include: unitxt_tasks.regression.two_texts
+task: stsb
+dataset_name: card=cards.stsb,template=templates.regression.two_texts.simple
--- a/lm_eval/tasks/unitxt/unfair_tos.yaml
+++ b/lm_eval/tasks/unitxt/unfair_tos.yaml
+include: unitxt_tasks.classification.multi_label
+task: unfair_tos
+dataset_name: card=cards.unfair_tos,template=templates.classification.multi_label.title
--- a/lm_eval/tasks/unitxt/unitxt_datasets
+++ b/lm_eval/tasks/unitxt/unitxt_datasets
+coedit_gec
+atis
+20_newsgroups
+ag_news
+argument_topic
+banking77
+claim_stance_topic
+cnn_dailymail
+dbpedia_14
+ethos_binary
+financial_tweets
+law_stack_exchange
+ledgar
+medical_abstracts
+stsb
+unfair_tos
+xsum
+yahoo_answers_topics