Commit d1c189ea authored by lintangsutawika's avatar lintangsutawika
Browse files

remove scrolls from this PR

parent 1f351067
# SCROLLS
### Paper
Title: `SCROLLS: Standardized CompaRison Over Long Language Sequences`
Abstract: https://arxiv.org/abs/2201.03533
SCROLLS is a suite of datasets that require synthesizing information over long texts.
The benchmark includes seven natural language tasks across multiple domains,
including summarization, question answering, and natural language inference.
Homepage: https://www.scrolls-benchmark.com/
Since SCROLLS tasks are generally longer than the maximum sequence length of many models,
it is possible to create "subset" tasks that contain only those samples whose tokenized length
is less than some pre-defined limit. For example, to create a subset of "Qasper" that would
be suitable for a model using the GPTNeoX tokenizer and a 4K maximium sequence length:
```
class QasperGPTNeoX4K(Qasper):
PRUNE_TOKENIZERS = ["EleutherAI/pythia-410m-deduped"]
PRUNE_MAX_TOKENS = 4096
PRUNE_NUM_PROC = _num_cpu_cores() # optional, to speed up pruning of large datasets like NarrativeQA
```
`PRUNE_TOKENIZERS` can contain more than one tokenizer; this will include only samples that are
less than `PRUNE_MAX_TOKENS` for ALL of the tokenizers. This can be useful to comparing models
that use different tokenizers but the same maximum sequence length.
Once the subset task class has been defined in this file, it can be used by adding the class
to `lm_eval/tasks/__init__.py`.
NOTE: GovReport may need `max_gen_toks` set larger for causal models.
### Citation
```
@inproceedings{shaham-etal-2022-scrolls,
title = "{SCROLLS}: Standardized {C}ompa{R}ison Over Long Language Sequences",
author = "Shaham, Uri and
Segal, Elad and
Ivgi, Maor and
Efrat, Avia and
Yoran, Ori and
Haviv, Adi and
Gupta, Ankit and
Xiong, Wenhan and
Geva, Mor and
Berant, Jonathan and
Levy, Omer",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.823",
pages = "12007--12021"
}
```
### Groups and Tasks
#### Groups
* `qasper`: executes both `qasper_bool` and `qasper_freeform`
#### Tasks
* `qasper_bool`: Multiple choice task that evaluates the task with `answer_type="bool"`
* `qasper_freeform`: Greedy generation task that evaluates the samples from the task with `answer_type="free form answer"`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
include: ../scroll_multiplechoice_task_yaml
task: scrolls_contractnli
dataset_name: contract_nli
process_docs: !function ../preprocessors.process_docs_prepended_question
doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
doc_to_target: "{{outputs[0]}}"
doc_to_choice: ["Not mentioned", "Entailment", "Contradiction"]
should_decontaminate: true
doc_to_decontamination_query: input
include: ../scroll_summary_task_yaml
task: scrolls_govreport
dataset_path: tau/scrolls
dataset_name: gov_report
import evaluate
rouge_fn = evaluate.load('rouge')
def rouge1(predictions, references):
results = rouge_fn.compute(predictions=predictions, references=references)
return results['rouge1']
def rouge2(predictions, references):
results = rouge_fn.compute(predictions=predictions, references=references)
return results['rouge2']
def rougeL(predictions, references):
results = rouge_fn.compute(predictions=predictions, references=references)
return results['rougeL']
squad_metric = evaluate.load("squad_v2")
def agg_f1(samples):
predictions, references = zip(*samples) # unzip, if you will
computed = squad_metric.compute(predictions=predictions, references=references)
return computed["f1"]
def _download_metric():
import os
import shutil
from huggingface_hub import hf_hub_download
scrolls_metric_path = hf_hub_download(repo_id="tau/scrolls", repo_type="dataset", filename="metrics/scrolls.py")
updated_scrolls_metric_path = (
os.path.dirname(scrolls_metric_path) + os.path.basename(scrolls_metric_path).replace(".", "_") + ".py"
)
shutil.copy(scrolls_metric_path, updated_scrolls_metric_path)
return updated_scrolls_metric_path
\ No newline at end of file
include: ../scroll_multiplechoice_task_yaml
task: scrolls_narrativeqa
dataset_name: narrative_qa
output_type: greedy_until
process_docs: !function ../preprocessors.process_docs_prepended_question
doc_to_text: "{{text}}\n\nQuestion: {{question}}\nAnswer:"
doc_to_target: "{{outputs| join(', ')}}"
should_decontaminate: true
doc_to_decontamination_query: input
metric_list:
- metric: f1
aggregation: !function ../metrics.agg_f1
higher_is_better: true
generation_kwargs:
until:
- "\n"
from functools import partial
from transformers import AutoTokenizer
def _num_cpu_cores():
# https://stackoverflow.com/questions/1006289/how-to-find-out-the-number-of-cpus-using-python/55423170#55423170
try:
import psutil
return psutil.cpu_count(logical=False)
except ImportError:
import os
return len(os.sched_getaffinity(0))
def process_docs(dataset, custom_process=None, PRUNE_TOKENIZERS=[], PRUNE_MAX_TOKENS=4096, PRUNE_NUM_PROC=_num_cpu_cores()):
def _drop_duplicates_in_input(untokenized_dataset):
# from scrolls/evaluator/dataset_evaluator.py
indices_to_keep = []
id_to_idx = {}
outputs = []
for i, (id_, output) in enumerate(zip(untokenized_dataset["id"], untokenized_dataset["output"])):
if id_ in id_to_idx:
outputs[id_to_idx[id_]].append(output)
continue
indices_to_keep.append(i)
id_to_idx[id_] = len(outputs)
outputs.append([output])
untokenized_dataset = untokenized_dataset.select(indices_to_keep).flatten_indices()
untokenized_dataset = untokenized_dataset.remove_columns("output")
untokenized_dataset = untokenized_dataset.add_column("outputs", outputs)
return untokenized_dataset
dataset = _drop_duplicates_in_input(dataset)
if custom_process is not None:
dataset = dataset.map(custom_process)
if len(PRUNE_TOKENIZERS) > 0:
tokenizers = [AutoTokenizer.from_pretrained(tokenizer) for tokenizer in PRUNE_TOKENIZERS]
cache = {}
def _get_prune_text(doc):
return doc_to_text(doc)
def _filter(sample):
text = _get_prune_text(sample)
cached = cache.get(text, None)
if cached is None:
for tokenizer in tokenizers:
if len(tokenizer(text).input_ids) > PRUNE_MAX_TOKENS:
cache[text] = False
return False
cache[text] = True
return True
else:
return cached
dataset = dataset.filter(_filter, num_proc=PRUNE_NUM_PROC)
return dataset
def _doc_prepended_question(doc):
# "When a query is given in addition to the raw text (as
# in QMSum, Qasper, NarrativeQA, QuALITY, and ContractNLI),
# we prepend it to the text, using two newlines as a natural separator"
input = doc["input"]
split = input.find("\n\n")
return {
"id": doc["id"],
"pid": doc["pid"],
"input": input,
"outputs": doc["outputs"],
"question": input[0:split],
"text": input[split + 2:]
}
process_docs_prepended_question = partial(process_docs, custom_process=_doc_prepended_question)
\ No newline at end of file
from functools import partial
from preprocessors import _doc_prepended_question
process_docs_prepended_questionGPTNeoX4K = partial(process_docs, custom_process=_doc_prepended_question)
\ No newline at end of file
group: scrolls
task: scrolls_qasper_boolean
dataset_path: tau/scrolls
dataset_name: qasper
output_type: multiple_choice
training_split: train
validation_split: validation
process_docs: !function ../preprocessors.process_docs_prepended_question
doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
doc_to_target: "{{outputs[0]}}"
doc_to_choice: ["yes", "no"]
should_decontaminate: true
doc_to_decontamination_query: input
metric_list:
- metric: f1
group: scrolls
task: scrolls_qasper_freeform
dataset_path: tau/scrolls
dataset_name: qasper
output_type: greedy_until
training_split: train
validation_split: validation
process_docs: !function ../preprocessors.process_docs_prepended_question
doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
doc_to_target: "{{outputs[0]}}"
should_decontaminate: true
doc_to_decontamination_query: input
metric_list:
- metric: f1
import transformers.data.metrics.squad_metrics as squad_metrics
def process_docs(dataset):
dataset = process_docs_prepended_question(dataset)
def _process_doc(doc):
doc["is_yes_no"] = reduce(lambda prev, cur: prev and squad_metrics.normalize_answer(cur)
in ["yes", "no"], doc["outputs"], True)
return doc
return dataset.map(_process_doc)
def process_results(doc, results):
if doc["is_yes_no"]:
prediction = " yes" if results[0] > results[1] else " no"
elif len(results[0].strip()) == 0:
prediction = "Unanswerable"
else:
prediction = results[0]
return {
"f1": (prediction, doc["outputs"])
}
\ No newline at end of file
include: ../scroll_summary_task_yaml
task: scrolls_qmsum
dataset_path: tau/scrolls
dataset_name: qmsum
process_docs: !function ../preprocessors.process_docs_prepended_question
include: ../scroll_multiplechoice_task_yaml
task: scrolls_quality
dataset_name: quality
process_docs: !function utils.process_docs
doc_to_text: "{{text}}\n\nQuestion: {{question}}\nAnswer:"
doc_to_target: gold
doc_to_choice: "{{choices}}"
import re
from functools import partial
import sys
sys.path.append('..')
from preprocessors import process_docs_prepended_question
def process_docs(dataset):
dataset = process_docs_prepended_question(dataset)
_multiple_choice_pattern = re.compile(r" *\([A-D]\) *")
def _normalize_answer(text):
return " ".join(text.split()).strip()
def _process_doc(doc):
split = doc["text"].find("\n\n", doc["text"].find("(D)"))
choices_text = doc["text"][:split]
doc["text"] = doc["text"][split:].strip()
doc["choices"] = [_normalize_answer(choice) for choice in re.split(
_multiple_choice_pattern, choices_text)[1:]]
doc["gold"] = doc["choices"].index(_normalize_answer(doc["outputs"][0]))
return doc
return dataset.map(_process_doc)
group: scrolls
dataset_path: tau/scrolls
output_type: multiple_choice
training_split: train
validation_split: validation
process_docs: !function preprocessors.process_docs
should_decontaminate: true
doc_to_decontamination_query: input
metric_list:
- metric: acc
- metric: acc_norm
group: scrolls
dataset_path: tau/scrolls
output_type: greedy_until
training_split: train
validation_split: validation
process_docs: !function preprocessors.process_docs
doc_to_text: "{{input}}\n\nQuestion: What is a summary of the preceding text?\nAnswer:"
doc_to_target: "{{outputs|join(', ')}}"
should_decontaminate: true
doc_to_decontamination_query: input
metric_list:
- metric: !function metrics.rouge1
aggregation: mean
higher_is_better: true
- metric: !function metrics.rouge2
aggregation: mean
higher_is_better: true
- metric: !function metrics.rougeL
aggregation: mean
higher_is_better: true
generation_kwargs:
until:
- "\n"
include: ../scroll_summary_task_yaml
task: scrolls_summscreenfd
dataset_path: tau/scrolls
dataset_name: summ_screen_fd
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment