Commit 60c9c170 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

Merge branch 'main' into inverse-scaling-tasks

parents 4b2d565b b4cd85d4
subject name category
dentistry 牙醫學 health
traditional_chinese_medicine_clinical_medicine 中醫臨床醫學 health
clinical_psychology 臨床心理學 psychology
technical 技術工相關 other
culinary_skills 餐旅 other
mechanical 機械與機電概論 other
logic_reasoning 邏輯思維 other
real_estate 房地產 other
general_principles_of_law 法學大意 law
finance_banking 金融與法規 business
anti_money_laundering 洗錢防制 law
ttqav2 台灣在地用語 culture
marketing_management 行銷管理 other
business_management 企業管理 other
organic_chemistry 有機化學 chemistry
advance_chemistry 化學 chemistry
physics 物理 physics
secondary_physics 高中物理 physics
human_behavior 人類行為與社會 psychology
national_protection 軍事 politics
jce_humanities 指考人文科目 philosophy
linear_algebra 線代 math
politic_science 政治 politics
agriculture 農業 other
official_document_management 機關文書 other
financial_analysis 財務分析 business
pharmacy 藥劑學 biology
educational_psychology 教育心理 psychology
statistics_and_machine_learning 統計與機器學習 engineering
management_accounting 管理會計 business
introduction_to_law 法律概論 law
computer_science 資訊工程 computer science
veterinary_pathology 獸醫病理學 health
accounting 會計學 business
fire_science 火災學 other
optometry 視光學 other
insurance_studies 保險學 other
pharmacology 藥理學 health
taxation 稅務 law
education_(profession_level) 教育專業 education
economics 經濟學 economics
veterinary_pharmacology 獸醫藥理學 health
nautical_science 航海 other
occupational_therapy_for_psychological_disorders 心理障礙職能治療學 psychology
trust_practice 信託實務 law
geography_of_taiwan 台灣地理 geography
physical_education 體育 education
auditing 審計學 business
administrative_law 行政法 law
basic_medical_science 基礎醫學 biology
macroeconomics 總經 economics
trade 貿易 business
chinese_language_and_literature 國文 culture
tve_design 統測_設計 other
junior_science_exam 國中會考基測自然科 biology
junior_math_exam 國中會考基測數學科 math
junior_chinese_exam 國中會考基測國文 culture
junior_social_studies 國中會考基測社會科 other
tve_mathematics 統測數學 math
tve_chinese_language 統測國文 culture
tve_natural_sciences 統測自然科 biology
junior_chemistry 國中理化 chemistry
music 音樂科 other
education 教育常識 education
three_principles_of_people 三民主義 culture
taiwanese_hokkien 閩南語 culture
engineering_math 工程數學 math
include: unitxt_tasks.classification.multi_class
task: 20_newsgroups
dataset_name: card=cards.20_newsgroups,template=templates.classification.multi_class.title
# Unitxt
### Paper
Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
Abstract: `https://arxiv.org/abs/2401.14019`
Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`
Homepage: https://unitxt.readthedocs.io/en/latest/index.html
### Citation
```
@misc{unitxt,
title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
year={2024},
eprint={2401.14019},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* `unitxt`: Subset of Unitxt tasks that were not in LM-Eval Harness task catalog, including new types of tasks like multi-label classification, grammatical error correction, named entity extraction.
#### Tasks
The full list of Unitxt tasks currently supported can be seen under `tasks/unitxt` directory.
### Adding tasks
You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.
The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.
To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
1. Add the card name to the `unitxt_datasets` file in the `tasks/unitxt` directory.
2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary. If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.
```
default_template_per_task = {
"tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
"tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
"tasks.summarization.abstractive" : "templates.summarization.abstractive.full",
"tasks.regression.two_texts" : "templates.regression.two_texts.simple",
"tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
"tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
"tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
}
```
3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)
If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:
https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
include: unitxt_tasks.classification.multi_class
task: ag_news
dataset_name: card=cards.ag_news,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: argument_topic
dataset_name: card=cards.argument_topic,template=templates.classification.multi_class.title
include: unitxt_tasks.span_labeling.extraction
task: atis
dataset_name: card=cards.atis,template=templates.span_labeling.extraction.title
include: unitxt_tasks.classification.multi_class
task: banking77
dataset_name: card=cards.banking77,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: claim_stance_topic
dataset_name: card=cards.claim_stance_topic,template=templates.classification.multi_class.title
include: unitxt_tasks.summarization.abstractive
task: cnn_dailymail
dataset_name: card=cards.cnn_dailymail,template=templates.summarization.abstractive.full
include: unitxt_tasks.grammatical_error_correction
task: coedit_gec
dataset_name: card=cards.coedit_gec,template=templates.grammatical_error_correction.simple
include: unitxt_tasks.classification.multi_class
task: dbpedia_14
dataset_name: card=cards.dbpedia_14,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: ethos_binary
dataset_name: card=cards.ethos_binary,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: financial_tweets
dataset_name: card=cards.financial_tweets,template=templates.classification.multi_class.title
#
# This file generates a set of LM eval harness yaml file
# that load unitxt datasets (https://github.com/IBM/unitxt)
#
import unitxt_wrapper
import yaml
from unitxt.artifact import fetch_artifact
from unitxt.standard import StandardRecipe
# This code is required to properly dump LM harness YAML that contains references to functions
def function_representer(dumper: yaml.SafeDumper, func) -> yaml.nodes.MappingNode:
return dumper.represent_scalar(
"!function", f"{func.__module__}.{func.__name__}", style=None
)
def write_task_yaml(filename, data):
yaml.add_representer(type(data["process_results"]), function_representer)
with open(filename, "w") as stream:
yaml.dump(data, stream, sort_keys=False)
def write_card_yaml(filename, data):
with open(filename, "w") as stream:
yaml.dump(data, stream, sort_keys=False)
default_template_per_task = {
"tasks.classification.multi_label": "templates.classification.multi_label.title",
"tasks.classification.multi_class": "templates.classification.multi_class.title",
"tasks.summarization.abstractive": "templates.summarization.abstractive.full",
"tasks.regression.two_texts": "templates.regression.two_texts.simple",
"tasks.qa.with_context.extractive": "templates.qa.with_context.simple",
"tasks.grammatical_error_correction": "templates.grammatical_error_correction.simple",
"tasks.span_labeling.extraction": "templates.span_labeling.extraction.title",
}
def generate_task_yaml(task: str):
"""
Generate an LM Eval Harness YAML file based on a Unitxt task defintion.
The output YAML is based on 'template.yaml.file' found in current directoy.
The common template is filled the the specific metrics for the task.
It still leaves the 'dataset_name' and 'task name' unspecified.
"""
print("*" * 80)
print("*")
print(f"* Generating YAML base file for task {task}")
print("*")
task_definition, _ = fetch_artifact(task)
data = {
"group": ["unitxt"],
"dataset_path": "unitxt/data",
"output_type": "generate_until",
"training_split": "train",
"validation_split": "test",
"doc_to_text": "{{source}}",
"doc_to_target": "target",
"process_results": unitxt_wrapper.process_results,
"generation_kwargs": {"until": ["</s>"]},
"metric_list": [],
"metadata": {"verison": 1.0},
}
for metric_name in task_definition.metrics:
new_metric = {"metric": "", "aggregation": "unitxt", "higher_is_better": True}
new_metric["metric"] = metric_name.replace("metrics.", "unitxt_")
data["metric_list"].append(new_metric)
write_task_yaml(f"unitxt_{task}", data)
def generate_card_yaml(card: str):
"""
Generate an LM Eval Harness YAML file based on the Unitxt dataset card.
It includes the task YAML for the dataset, and overrides the 'dataset_name' and 'task' with the card.
"""
print("*" * 80)
print("*")
print(f"* Generating YAML file for unitxt dataset {card}")
print("*")
card_definition, _ = fetch_artifact(f"cards.{card}")
task = card_definition.task.__id__
if task in default_template_per_task:
template = default_template_per_task[task]
else:
raise ValueError(
f"Default template was not defined for task {task} in 'default_template_per_task' dict in generate_yamls.py"
)
data = {}
data["include"] = f"unitxt_{task}"
data["task"] = card
data["dataset_name"] = f"card=cards.{card},template={template}"
# This is faster that the load_dataset approach
# dataset = load_dataset('unitxt/data', data["dataset_name"]+",loader_limit=100",trust_remote_code=True)
recipe = StandardRecipe(card=f"cards.{card}", template=template, loader_limit=100)
stream = recipe()
dataset = stream.to_dataset()
print(dataset)
print("Sample input:")
print(dataset["test"][0]["source"])
print("Sample output:")
print(dataset["test"][0]["target"])
write_card_yaml(f"{card}.yaml", data)
def main():
for task in default_template_per_task.keys():
try:
generate_task_yaml(task)
except Exception as e:
print(f"Unable to generate YAML for {task} due to:")
print(e)
raise (e)
with open("unitxt_datasets") as f:
for unitxt_dataset in f:
unitxt_dataset = unitxt_dataset.strip()
if unitxt_dataset.startswith("### END ###"):
exit(0)
if not unitxt_dataset.startswith("#"):
try:
generate_card_yaml(unitxt_dataset)
except Exception as e:
print(f"Unable to generate YAML for {unitxt_dataset} due to:")
print(e)
raise e
if __name__ == "__main__":
main()
include: unitxt_tasks.classification.multi_class
task: law_stack_exchange
dataset_name: card=cards.law_stack_exchange,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: ledgar
dataset_name: card=cards.ledgar,template=templates.classification.multi_class.title
include: unitxt_tasks.classification.multi_class
task: medical_abstracts
dataset_name: card=cards.medical_abstracts,template=templates.classification.multi_class.title
include: unitxt_tasks.regression.two_texts
task: stsb
dataset_name: card=cards.stsb,template=templates.regression.two_texts.simple
include: unitxt_tasks.classification.multi_label
task: unfair_tos
dataset_name: card=cards.unfair_tos,template=templates.classification.multi_label.title
coedit_gec
atis
20_newsgroups
ag_news
argument_topic
banking77
claim_stance_topic
cnn_dailymail
dbpedia_14
ethos_binary
financial_tweets
law_stack_exchange
ledgar
medical_abstracts
stsb
unfair_tos
xsum
yahoo_answers_topics
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment