remove scrolls from this PR

d1c189ea · lintangsutawika · 1f351067 · 1f351067 · 1f351067 · 1f351067
Commit d1c189ea authored Sep 18, 2023 by lintangsutawika
16 changed files
--- a/lm_eval/tasks/scrolls/README.md
+++ b/lm_eval/tasks/scrolls/README.md
-# SCROLLS
-
-### Paper
-
-Title: `SCROLLS: Standardized CompaRison Over Long Language Sequences`
-
-Abstract: https://arxiv.org/abs/2201.03533
-
-SCROLLS is a suite of datasets that require synthesizing information over long texts.
-The benchmark includes seven natural language tasks across multiple domains,
-including summarization, question answering, and natural language inference.
-
-Homepage: https://www.scrolls-benchmark.com/
-
-Since SCROLLS tasks are generally longer than the maximum sequence length of many models,
-it is possible to create "subset" tasks that contain only those samples whose tokenized length
-is less than some pre-defined limit. For example, to create a subset of "Qasper" that would
-be suitable for a model using the GPTNeoX tokenizer and a 4K maximium sequence length:
-
-```
-class QasperGPTNeoX4K(Qasper):
-    PRUNE_TOKENIZERS = ["EleutherAI/pythia-410m-deduped"]
-    PRUNE_MAX_TOKENS = 4096
-    PRUNE_NUM_PROC = _num_cpu_cores() # optional, to speed up pruning of large datasets like NarrativeQA
-```
-
-`PRUNE_TOKENIZERS` can contain more than one tokenizer; this will include only samples that are
-less than `PRUNE_MAX_TOKENS` for ALL of the tokenizers. This can be useful to comparing models
-that use different tokenizers but the same maximum sequence length.
-
-Once the subset task class has been defined in this file, it can be used by adding the class
-to `lm_eval/tasks/__init__.py`.
-
-NOTE: GovReport may need `max_gen_toks` set larger for causal models.
-
-### Citation
-
-```
-@inproceedings{shaham-etal-2022-scrolls,
-    title = "{SCROLLS}: Standardized {C}ompa{R}ison Over Long Language Sequences",
-    author = "Shaham, Uri  and 
-      Segal, Elad  and
-      Ivgi, Maor  and
-      Efrat, Avia  and
-      Yoran, Ori  and
-      Haviv, Adi  and
-      Gupta, Ankit  and
-      Xiong, Wenhan  and
-      Geva, Mor  and
-      Berant, Jonathan  and
-      Levy, Omer",
-    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
-    month = dec,
-    year = "2022",
-    address = "Abu Dhabi, United Arab Emirates",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2022.emnlp-main.823",
-    pages = "12007--12021"
-}
-```
-
-### Groups and Tasks
-
-#### Groups
-
-* `qasper`: executes both `qasper_bool` and `qasper_freeform`
-
-#### Tasks
-
-* `qasper_bool`: Multiple choice task that evaluates the task with `answer_type="bool"` 
-* `qasper_freeform`: Greedy generation task that evaluates the samples from the task with `answer_type="free form answer"`
-
-### Checklist
-
-For adding novel benchmarks/datasets to the library:
-* [ ] Is the task an existing benchmark in the literature?
-  * [ ] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-
-
-If other tasks on this dataset are already supported:
-* [ ] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/scrolls/contractnli/default.yaml
+++ b/lm_eval/tasks/scrolls/contractnli/default.yaml
-include: ../scroll_multiplechoice_task_yaml
-task: scrolls_contractnli
-dataset_name: contract_nli
-process_docs: !function ../preprocessors.process_docs_prepended_question
-doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
-doc_to_target: "{{outputs[0]}}"
-doc_to_choice: ["Not mentioned", "Entailment", "Contradiction"]
-should_decontaminate: true
-doc_to_decontamination_query: input
--- a/lm_eval/tasks/scrolls/govreport/default.yaml
+++ b/lm_eval/tasks/scrolls/govreport/default.yaml
-include: ../scroll_summary_task_yaml
-task: scrolls_govreport
-dataset_path: tau/scrolls
-dataset_name: gov_report
--- a/lm_eval/tasks/scrolls/metrics.py
+++ b/lm_eval/tasks/scrolls/metrics.py
-import evaluate
-
-rouge_fn = evaluate.load('rouge')
-
-def rouge1(predictions, references):
-    results = rouge_fn.compute(predictions=predictions, references=references)
-    return results['rouge1']
-
-def rouge2(predictions, references):
-    results = rouge_fn.compute(predictions=predictions, references=references)
-    return results['rouge2']
-
-def rougeL(predictions, references):
-    results = rouge_fn.compute(predictions=predictions, references=references)
-    return results['rougeL']
-
-squad_metric = evaluate.load("squad_v2")
-
-def agg_f1(samples):
-    predictions, references = zip(*samples)  # unzip, if you will
-    computed = squad_metric.compute(predictions=predictions, references=references)
-    return computed["f1"]
-
-
-def _download_metric():
-    import os
-    import shutil
-    from huggingface_hub import hf_hub_download
-    scrolls_metric_path = hf_hub_download(repo_id="tau/scrolls", repo_type="dataset", filename="metrics/scrolls.py")
-    updated_scrolls_metric_path = (
-        os.path.dirname(scrolls_metric_path) + os.path.basename(scrolls_metric_path).replace(".", "_") + ".py"
-    )
-    shutil.copy(scrolls_metric_path, updated_scrolls_metric_path)
-    return updated_scrolls_metric_path
\ No newline at end of file
--- a/lm_eval/tasks/scrolls/narrativeqa/default.yaml
+++ b/lm_eval/tasks/scrolls/narrativeqa/default.yaml
-include: ../scroll_multiplechoice_task_yaml
-task: scrolls_narrativeqa
-dataset_name: narrative_qa
-output_type: greedy_until
-process_docs: !function ../preprocessors.process_docs_prepended_question
-doc_to_text: "{{text}}\n\nQuestion: {{question}}\nAnswer:"
-doc_to_target: "{{outputs| join(', ')}}"
-should_decontaminate: true
-doc_to_decontamination_query: input
-metric_list:
-  - metric: f1
-    aggregation: !function ../metrics.agg_f1
-    higher_is_better: true
-generation_kwargs:
-  until:
-    - "\n"
--- a/lm_eval/tasks/scrolls/preprocessors.py
+++ b/lm_eval/tasks/scrolls/preprocessors.py
-from functools import partial
-
-from transformers import AutoTokenizer
-
-def _num_cpu_cores():
-    # https://stackoverflow.com/questions/1006289/how-to-find-out-the-number-of-cpus-using-python/55423170#55423170
-    try:
-        import psutil
-        return psutil.cpu_count(logical=False)
-    except ImportError:
-        import os
-        return len(os.sched_getaffinity(0))
-
-def process_docs(dataset, custom_process=None, PRUNE_TOKENIZERS=[], PRUNE_MAX_TOKENS=4096, PRUNE_NUM_PROC=_num_cpu_cores()):
-
-    def _drop_duplicates_in_input(untokenized_dataset):
-        # from scrolls/evaluator/dataset_evaluator.py
-
-        indices_to_keep = []
-        id_to_idx = {}
-        outputs = []
-        for i, (id_, output) in enumerate(zip(untokenized_dataset["id"], untokenized_dataset["output"])):
-            if id_ in id_to_idx:
-                outputs[id_to_idx[id_]].append(output)
-                continue
-            indices_to_keep.append(i)
-            id_to_idx[id_] = len(outputs)
-            outputs.append([output])
-        untokenized_dataset = untokenized_dataset.select(indices_to_keep).flatten_indices()
-        untokenized_dataset = untokenized_dataset.remove_columns("output")
-        untokenized_dataset = untokenized_dataset.add_column("outputs", outputs)
-        return untokenized_dataset
-
-    dataset = _drop_duplicates_in_input(dataset)
-    if custom_process is not None:
-        dataset = dataset.map(custom_process)
-    
-    if len(PRUNE_TOKENIZERS) > 0:
-        tokenizers = [AutoTokenizer.from_pretrained(tokenizer) for tokenizer in PRUNE_TOKENIZERS]
-        cache = {}
-
-        def _get_prune_text(doc):
-            return doc_to_text(doc)
-
-        def _filter(sample):
-            text = _get_prune_text(sample)
-            cached = cache.get(text, None)
-            if cached is None:
-                for tokenizer in tokenizers:
-                    if len(tokenizer(text).input_ids) > PRUNE_MAX_TOKENS:
-                        cache[text] = False
-                        return False
-                cache[text] = True
-                return True
-            else:
-                return cached
-
-        dataset = dataset.filter(_filter, num_proc=PRUNE_NUM_PROC)
-
-    return dataset
-
-def _doc_prepended_question(doc):
-    # "When a query is given in addition to the raw text (as
-    # in QMSum, Qasper, NarrativeQA, QuALITY, and ContractNLI),
-    # we prepend it to the text, using two newlines as a natural separator"
-    input = doc["input"]
-    split = input.find("\n\n")
-    return {
-        "id": doc["id"],
-        "pid": doc["pid"],
-        "input": input,
-        "outputs": doc["outputs"],
-        "question": input[0:split],
-        "text": input[split + 2:]
-    }
-
-process_docs_prepended_question = partial(process_docs, custom_process=_doc_prepended_question)
\ No newline at end of file
--- a/lm_eval/tasks/scrolls/pruned_tasks.py
+++ b/lm_eval/tasks/scrolls/pruned_tasks.py
-from functools import partial
-from preprocessors import _doc_prepended_question
-
-process_docs_prepended_questionGPTNeoX4K = partial(process_docs, custom_process=_doc_prepended_question)
\ No newline at end of file
--- a/lm_eval/tasks/scrolls/qasper/boolean_task.yaml
+++ b/lm_eval/tasks/scrolls/qasper/boolean_task.yaml
-group: scrolls
-task: scrolls_qasper_boolean
-dataset_path: tau/scrolls
-dataset_name: qasper
-output_type: multiple_choice
-training_split: train
-validation_split: validation
-process_docs: !function ../preprocessors.process_docs_prepended_question
-doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
-doc_to_target: "{{outputs[0]}}"
-doc_to_choice: ["yes", "no"]
-should_decontaminate: true
-doc_to_decontamination_query: input
-metric_list:
-  - metric: f1
--- a/lm_eval/tasks/scrolls/qasper/freeform_task.yaml
+++ b/lm_eval/tasks/scrolls/qasper/freeform_task.yaml
-group: scrolls
-task: scrolls_qasper_freeform
-dataset_path: tau/scrolls
-dataset_name: qasper
-output_type: greedy_until
-training_split: train
-validation_split: validation
-process_docs: !function ../preprocessors.process_docs_prepended_question
-doc_to_text: "{{text}}\n\nHypothesis: {{question}}\nConclusion:"
-doc_to_target: "{{outputs[0]}}"
-should_decontaminate: true
-doc_to_decontamination_query: input
-metric_list:
-  - metric: f1
--- a/lm_eval/tasks/scrolls/qasper/utils.py
+++ b/lm_eval/tasks/scrolls/qasper/utils.py
-import transformers.data.metrics.squad_metrics as squad_metrics
-
-def process_docs(dataset):
-
-    dataset = process_docs_prepended_question(dataset)
-
-    def _process_doc(doc):
-
-        doc["is_yes_no"] = reduce(lambda prev, cur: prev and squad_metrics.normalize_answer(cur)
-                                  in ["yes", "no"], doc["outputs"], True)
-
-        return doc
-
-    return dataset.map(_process_doc)
-
-def process_results(doc, results):
-    if doc["is_yes_no"]:
-        prediction = " yes" if results[0] > results[1] else " no"
-    elif len(results[0].strip()) == 0:
-        prediction = "Unanswerable"
-    else:
-        prediction = results[0]
-    return {
-        "f1": (prediction, doc["outputs"])
-    }
\ No newline at end of file
--- a/lm_eval/tasks/scrolls/qmsum/default.yaml
+++ b/lm_eval/tasks/scrolls/qmsum/default.yaml
-include: ../scroll_summary_task_yaml
-task: scrolls_qmsum
-dataset_path: tau/scrolls
-dataset_name: qmsum
-process_docs: !function ../preprocessors.process_docs_prepended_question
--- a/lm_eval/tasks/scrolls/quality/default.yaml
+++ b/lm_eval/tasks/scrolls/quality/default.yaml
-include: ../scroll_multiplechoice_task_yaml
-task: scrolls_quality
-dataset_name: quality
-process_docs: !function utils.process_docs
-doc_to_text: "{{text}}\n\nQuestion: {{question}}\nAnswer:"
-doc_to_target: gold
-doc_to_choice: "{{choices}}"
--- a/lm_eval/tasks/scrolls/quality/utils.py
+++ b/lm_eval/tasks/scrolls/quality/utils.py
-import re
-from functools import partial
-
-import sys
-sys.path.append('..')
-from preprocessors import process_docs_prepended_question
-
-def process_docs(dataset):
-
-    dataset = process_docs_prepended_question(dataset)
-
-    _multiple_choice_pattern = re.compile(r" *\([A-D]\) *")
-
-    def _normalize_answer(text):
-        return " ".join(text.split()).strip()
-
-    def _process_doc(doc):
-
-        split = doc["text"].find("\n\n", doc["text"].find("(D)"))
-        choices_text = doc["text"][:split]
-
-        doc["text"] = doc["text"][split:].strip()
-        doc["choices"] = [_normalize_answer(choice) for choice in re.split(
-            _multiple_choice_pattern, choices_text)[1:]]
-        doc["gold"] = doc["choices"].index(_normalize_answer(doc["outputs"][0]))
-
-        return doc
-
-    return dataset.map(_process_doc)
--- a/lm_eval/tasks/scrolls/scroll_multiplechoice_task_yaml
+++ b/lm_eval/tasks/scrolls/scroll_multiplechoice_task_yaml
-group: scrolls
-dataset_path: tau/scrolls
-output_type: multiple_choice
-training_split: train
-validation_split: validation
-process_docs: !function preprocessors.process_docs
-should_decontaminate: true
-doc_to_decontamination_query: input
-metric_list:
-  - metric: acc
-  - metric: acc_norm
--- a/lm_eval/tasks/scrolls/scroll_summary_task_yaml
+++ b/lm_eval/tasks/scrolls/scroll_summary_task_yaml
-group: scrolls
-dataset_path: tau/scrolls
-output_type: greedy_until
-training_split: train
-validation_split: validation
-process_docs: !function preprocessors.process_docs
-doc_to_text: "{{input}}\n\nQuestion: What is a summary of the preceding text?\nAnswer:"
-doc_to_target: "{{outputs|join(', ')}}"
-should_decontaminate: true
-doc_to_decontamination_query: input
-metric_list:
-  - metric: !function metrics.rouge1
-    aggregation: mean
-    higher_is_better: true
-  - metric: !function metrics.rouge2
-    aggregation: mean
-    higher_is_better: true
-  - metric: !function metrics.rougeL
-    aggregation: mean
-    higher_is_better: true
-generation_kwargs:
-  until:
-    - "\n"
--- a/lm_eval/tasks/scrolls/summscreenfd/default.yaml
+++ b/lm_eval/tasks/scrolls/summscreenfd/default.yaml
-include: ../scroll_summary_task_yaml
-task: scrolls_summscreenfd
-dataset_path: tau/scrolls
-dataset_name: summ_screen_fd