Merge pull request #261 from EleutherAI/researcher2

Update CLI options and introduce decontamination

Merge pull request #261 from EleutherAI/researcher2
Update CLI options and introduce decontamination
e00d682f · Jonathan Tow · GitHub · eb8163e9 · ab6883b1 · e00d682f
Unverified Commit e00d682f authored Apr 30, 2022 by Jonathan Tow Committed by GitHub Apr 30, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
 ```bash
 python main.py \
 	--model gpt2 \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
 (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag \
 	--num_fewshot 2
 ```
@@ -392,6 +392,21 @@ python write_out.py \
 This will write out one text file for each task.
+### Test Set Decontamination
+For more details see the [decontamination guide](./docs/decontamination.md).
+The directory provided with the "--decontamination_ngrams_path" argument should contain
+the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \ 
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
 ### Code Structure
 There are two major components of the library:

--- a/docs/decontamination.md
+++ b/docs/decontamination.md
+# Decontamination
+## Usage
+Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain
+the ngram files and info.json produced in "Pile Ngram Generation" further down.
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+## Background
+Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
+As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
+The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
+## Implementation
+Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/". 
+decontaminate.py does the following:
+1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
+2. Scan through sorted files containing training set n-grams.
+3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
+"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix. 
+This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
+## Pile Ngram Generation
+The relevant scripts can be found in scripts/clean_training_data, which also import from
+"lm_eval/decontamination/"
+1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+2. pip install -r requirements.txt
+3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
+4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
+5. Run generate_13_grams.
+```bash
+export PYTHONHASHSEED=0
+python -m scripts/clean_training_data/generate_13_grams \
+    -dir path/to/working/directory \
+    -n 13 \
+    -buckets 500
+```
+Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
+6. Sort the generated 13-grams.
+```bash
+python -m scripts/clean_training_data/sort_13_gram_buckets \
+    -dir path/to/working/directory/output
+```
+Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
+7. Compress the sorted 13 grams files and place them together with info.json.
+This step only takes a few hours.
+```bash
+python -m scripts/clean_training_data/compress_and_package \
+    -dir path/to/working/directory \
+    -output path/to/final/directory \
+    -procs 8
+```
+Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -151,6 +151,13 @@ def doc_to_target(self, doc):
 Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
+### Decontamination
+For background on decontamination please see [this](./decontamination.md).
+If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true. 
+You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
 ### Registering Your Task
 Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).

--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -415,6 +415,10 @@ class Task(abc.ABC):
            download_mode=download_mode
        )
+    def should_decontaminate(self):
+        """Whether this task supports decontamination against model training set."""
+        return False
    @abstractmethod
    def has_training_docs(self):
        """Whether the task has a training set"""
@@ -468,6 +472,10 @@ class Task(abc.ABC):
        return rnd.sample(self._training_docs, k)
+    def doc_to_decontamination_query(self, doc):
+        print("Override doc_to_decontamination_query with document specific decontamination query.")
+        assert(False)
    @abstractmethod
    def doc_to_text(self, doc):
        pass
@@ -625,6 +633,10 @@ class MultipleChoiceTask(Task):
 class PerplexityTask(Task, abc.ABC):
+    def should_decontaminate(self):
+        """Whether this task supports decontamination against model training set."""
+        return True
    def has_training_docs(self):
        return False
@@ -653,6 +665,9 @@ class PerplexityTask(Task, abc.ABC):
            "bits_per_byte": False,
        }
+    def doc_to_decontamination_query(self, doc):
+        return doc
    def doc_to_text(self, doc):
        return ""

--- a/lm_eval/decontamination/__init__.py
+++ b/lm_eval/decontamination/__init__.py
--- a/scripts/clean_training_data/archiver.py
+++ b/scripts/clean_training_data/archiver.py
@@ -4,6 +4,9 @@ import json
 import jsonlines
 import io
 import datetime
+import mmap
+import tqdm
+from pathlib import Path
 def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""
@@ -59,16 +62,19 @@ class Reader:
                else:
                    yield text
-# Simple text reader and writer with same interface as above
 class TextArchive:
-    def __init__(self, file_path, mode="ab"):
+    def __init__(self, file_path, mode="rb+"):
        self.file_path = file_path
        dir_name = os.path.dirname(file_path)
        if dir_name:
            os.makedirs(dir_name, exist_ok=True)    
-        self.fh = open(self.file_path, mode)      
+        if not os.path.exists(file_path):
+            Path(file_path).touch()
+        self.fh = open(self.file_path, mode)    
-    def add_data(self, data, meta={}):
+    def add_data(self, data):
        self.fh.write(data.encode('UTF-8') + b'\n')
    def commit(self):
@@ -79,12 +85,63 @@ class TextReader:
    def __init__(self, file_path):
        self.file_path = file_path
+    # Optimized mmap read with infrequent tqdm updates to maintain speed
+    # Tested up to 250MB/s. 
+    def read_tqdm(self, update_frequency=10000):
+        current_file_position = 0
+        line_counter = 0
+        with open(self.file_path, 'r') as fh, \
+            tqdm.tqdm(total=os.path.getsize(self.file_path), dynamic_ncols=True, 
+                unit="byte", unit_scale=1) as progress:
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")
+                    line_counter += 1
+                    if line_counter == update_frequency:
+                        new_file_pos = mmap_obj.tell() 
+                        bytes_read = new_file_pos - current_file_position
+                        current_file_position = new_file_pos
+                        progress.update(bytes_read)
+                        line_counter = 0
+                    yield line[:-1]
+    def read_and_tell(self):
+        current_file_position = 0
+        with open(self.file_path, 'r', encoding="utf8") as fh:
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")                    
+                    new_file_pos = mmap_obj.tell() 
+                    raw_bytes_read = new_file_pos - current_file_position
+                    current_file_position = new_file_pos
+                    yield line[:-1], raw_bytes_read
    def read(self):
        with open(self.file_path, 'r', encoding="utf8") as fh:
-            self.fh = fh
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")                    
+                    yield line[:-1]
+    def read_slow(self):
+        with open(self.file_path, 'r', encoding="utf8") as fh:
            while True:
-                line = self.fh.readline()
+                line = fh.readline()
                if line == -1 or line == "":
                    break
-                else :
+                else:
                    yield line[:-1]
\ No newline at end of file
+# Optimized for speed. Decompresses the archive in shell before
+# using the mmap'd TextReader.
+class ZStdTextReader:
+    def __init__(self, file):
+        self.file = file        
+    def read_tqdm(self):       
+        decompressed_file = self.file[:-4]
+        print("Decompressing file, please wait...")
+        os.system(f"zstd -d {self.file}") # linux decompress is faster        
+        reader = TextReader(decompressed_file)
+        yield from reader.read_tqdm()
+        os.remove(decompressed_file)
\ No newline at end of file
--- a/lm_eval/decontamination/decontaminate.py
+++ b/lm_eval/decontamination/decontaminate.py
+import time
+import random
+import pickle
+import json
+import glob
+import os
+import collections
+from .janitor import Janitor, word_ngrams
+from .archiver import ZStdTextReader
+# Was used for testing the evaluator decoupled from the full logic below
+def get_train_overlap_stub(docs, ngrams_path, ngrams_n_size):
+    simulated_overlap = 0.1
+    contaminated = int(len(docs) * simulated_overlap)
+    return random.sample(range(len(docs)), contaminated)
+# Returns a dictionary containing all overlapping documents in each
+# task. In the standard use case, an overlap occurs when any of the 13-grams
+# found in the task document exist in the training set documents.
+#
+# To generate 13-grams for the pile see scripts/clean_training_data. The final output of these
+# scripts are an info.json file containing the n_gram_size (13) and a bunch of "ngrams_{x}.bkt.txt.sorted.zst"
+# files. These should exist in the "ngrams_path" provided to this function.
+# Algorithm:
+# 1. Build lookups for each dataset {ngram: list(document_ids)}
+# 2. Merge into an overall lookup {ngram: [(task_name, task_set, doc_ids),]}
+# 3. Full scan the 13-grams from the training set against the merged lookup,
+#    saving matches in the "duplicates" dictionary {(task_name, task_set): set(doc_ids)}
+# 4. Strip the task_set from the dictionary keys and return
+#
+# We cache the task+set lookups as well as the overlaps.
+def get_train_overlap(docs_by_task_set, ngrams_path, limit):
+    # return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size)
+    info_dict_path = os.path.join(ngrams_path, "info.json")
+    info_dict = json.load(open(info_dict_path, "r"))
+    ngrams_n_size = info_dict["ngram_size"]
+    janitor = Janitor()
+    # Build lookup for each dataset first in case we use different task combinations later
+    print("Building Lookups...")
+    start = time.perf_counter()
+    def get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit):
+        return f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.overlaps"
+    lookups = {}
+    duplicates = {} # (task_name, task_set): set(doc_ids)}
+    sets_to_decontaminate = len(docs_by_task_set.keys())
+    for (task_name, task_set), docs in docs_by_task_set.items():
+        if not os.path.exists(f"data/{task_name}"):
+            os.mkdir(f"data/{task_name}")
+        # Check if we've decontaminated this combination before
+        overlaps_dump_path = get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit)
+        if os.path.exists(overlaps_dump_path):
+            duplicates[(task_name, task_set)] = pickle.load(open(overlaps_dump_path, "rb"))
+            sets_to_decontaminate -= 1
+            continue
+        else:
+            duplicates[(task_name, task_set)] = set()
+        # Build/load the task lookup {ngram: set(documents)}.
+        task_set_lookup_path = f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.lookup"
+        if os.path.exists(task_set_lookup_path):
+            print(f"{task_set_lookup_path} available, loading...")        
+            lookups[(task_name, task_set)] = pickle.load(open(task_set_lookup_path, "rb"))
+        else:
+            print(f"{task_set_lookup_path} not available, building...")
+            lookup = collections.defaultdict(set)
+            for doc_id, document in enumerate(docs):
+                ngrams = word_ngrams(janitor.normalize_string(document), ngrams_n_size)
+                for ngram in ngrams:
+                    lookup[ngram].add(doc_id)
+            pickle.dump(lookup, open(task_set_lookup_path,"wb"))
+            lookups[(task_name, task_set)] = lookup
+    elapsed = time.perf_counter() - start
+    print(f"Building lookups took {elapsed:0.5f} seconds.")
+    matched_ngrams = []
+    if sets_to_decontaminate > 0:
+        print("Merging lookups...")
+        start = time.perf_counter()    
+        merged_lookup = collections.defaultdict(list)
+        for (task_name, task_set), lookup in lookups.items():
+            for ngram, doc_ids in lookup.items():
+                merged_lookup[ngram].append((task_name, task_set, doc_ids))
+        elapsed = time.perf_counter() - start
+        print(f"Merging lookups took {elapsed:0.5f} seconds.")
+        print(f"{ngrams_n_size} grams files found in {ngrams_path}:")
+        files = glob.glob(os.path.join(ngrams_path, f"*.sorted.zst"))
+        print(files)
+        for file in files:
+            start = time.perf_counter()
+            print(f"Scanning {file}")
+            reader = ZStdTextReader(file)
+            total_ngrams = 0
+            unique_ngrams = 0
+            matching_unique = 0
+            non_matching_unique = 0
+            current_ngram = ""
+            for line in reader.read_tqdm(): # Scan training set ngrams file
+                total_ngrams += 1
+                [ngram, document_id] = line.rsplit(" ", 1)
+                if ngram != current_ngram: # Only need to match the ngram once in training set
+                    unique_ngrams += 1
+                    current_ngram = ngram
+                    if ngram in merged_lookup:
+                        matched_ngrams.append(ngram) # For logging
+                        matching_unique += 1
+                        for task_name, task_set, doc_ids in merged_lookup[ngram]:
+                            task_doc_set = duplicates[(task_name, task_set)]
+                            for doc_id in doc_ids: # Record contamination across all relevant task/set combos
+                                task_doc_set.add(doc_id)
+                        del merged_lookup[ngram] # No point matching again
+                    else:
+                        non_matching_unique += 1
+            print(f"Total Ngrams: {total_ngrams}")
+            print(f"Unique Ngrams: {unique_ngrams}")
+            print(f"Unique Matching: {matching_unique}")
+            print(f"Unique Non Matching: {non_matching_unique}")
+            print("Matched ngrams:")
+            for ngram in matched_ngrams:
+                print(ngram)
+            elapsed = time.perf_counter() - start
+            print(f"Read took {elapsed:0.5f} seconds.")
+            print(f"Speed: {(os.path.getsize(file)/1000000.0)/elapsed}MB/second")
+        print(duplicates)
+        # Dump overlaps separately    
+        for (task_name, task_set), doc_ids in duplicates.items():
+            overlaps_dump_path = get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit)
+            pickle.dump(doc_ids, open(overlaps_dump_path,"wb"))
+    # Strip task set and return
+    return {task_name: doc_ids for (task_name, task_set), doc_ids in duplicates.items()}
--- a/scripts/clean_training_data/janitor.py
+++ b/scripts/clean_training_data/janitor.py
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -6,15 +6,18 @@ import lm_eval.metrics
 import lm_eval.models
 import lm_eval.tasks
 import lm_eval.base
+import lm_eval.decontamination
 import numpy as np
 from lm_eval.utils import positional_deprecated, run_task_tests
+from lm_eval.decontamination.decontaminate import get_train_overlap
 @positional_deprecated
 def simple_evaluate(model, model_args=None, tasks=[],
                    num_fewshot=0, batch_size=None, device=None,
                    no_cache=False, limit=None, bootstrap_iters=100000,
-                    description_dict=None, check_integrity=False):
+                    description_dict=None, check_integrity=False, 
+                    decontamination_ngrams_path=None):
    """Instantiate and evaluate a model on a list of tasks.
    :param model: Union[str, LM]
@@ -72,7 +75,8 @@ def simple_evaluate(model, model_args=None, tasks=[],
        task_dict=task_dict,
        num_fewshot=num_fewshot,
        limit=limit,
-        description_dict=description_dict
+        description_dict=description_dict,
+        decontamination_ngrams_path=decontamination_ngrams_path, 
    )
    # add info about the model and few shot config
@@ -90,9 +94,11 @@ def simple_evaluate(model, model_args=None, tasks=[],
    return results
+decontaminate_suffix = "_decontaminate"
 @positional_deprecated
-def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
+def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None,
+             decontamination_ngrams_path=None):
    """Instantiate and evaluate a model on a list of tasks.
    :param lm: obj
@@ -120,6 +126,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
        # nudge people to not specify it at all
        print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+    decontaminate = decontamination_ngrams_path is not None
    task_dict_items = [
        (name, task)
        for name, task in task_dict.items()
@@ -132,6 +140,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
    requests = collections.defaultdict(list)
    requests_origin = collections.defaultdict(list)
+    overlaps = collections.defaultdict(list) # {task_name: contaminated_docs}
    # If we ever run into issues where the eval tasks don't fit in memory and we can't afford a machine with bigger
    # memory, we can always modify this plumbing to support that, but I didn't want to include it just yet because
    # over-engineering is bad (or we could make it write the requests to disk and then read them back out again
@@ -140,6 +150,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
    # TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable
    docs = {}
+    docs_for_decontamination = collections.defaultdict(list)
    # get lists of each type of request
    for task_name, task in task_dict_items:
        versions[task_name] = task.VERSION
@@ -147,7 +159,9 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
        # TODO: the test-fallback-to-val system isn't final, we should revisit it at some point
        if task.has_test_docs():
            task_doc_func = task.test_docs
+            task_set = "test" # Required for caching in the decontamination
        elif task.has_validation_docs():
+            task_set = "val" # Required for caching in the decontamination
            task_doc_func = task.validation_docs
        else:
            raise RuntimeError("Task has neither test_docs nor validation_docs")
@@ -161,6 +175,10 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
        description = description_dict[task_name] if description_dict and task_name in description_dict else ""
        for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
+            if decontaminate and task.should_decontaminate():
+                docs_for_decontamination[(task_name, task_set)].append(task.doc_to_decontamination_query(doc))
            docs[(task_name, doc_id)] = doc
            ctx = task.fewshot_context(
                doc=doc,
@@ -177,6 +195,11 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
                # doc_id: unique id that we can get back to a doc using `docs`
                requests_origin[req.request_type].append((i, task_name, doc, doc_id))
+    # Compare all tasks/sets at once to ensure a single training set scan
+    if decontaminate:
+        print("Finding train/test overlap, please wait...")
+        overlaps = get_train_overlap(docs_for_decontamination, decontamination_ngrams_path, limit)
    # all responses for each (task, doc)
    process_res_queue = collections.defaultdict(list)
@@ -207,18 +230,28 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
        metrics = task.process_results(doc, requests)
        for metric, value in metrics.items():
            vals[(task_name, metric)].append(value)
+            # Re-use the evaluation for the decontaminated set by just ignoring the overlaps
+            if decontaminate and task_name in overlaps:
+                if doc_id not in overlaps[task_name]:
+                    vals[(task_name, metric + decontaminate_suffix)].append(value)
    # aggregate results
    for (task_name, metric), items in vals.items():
        task = task_dict[task_name]
-        results[task_name][metric] = task.aggregation()[metric](items)
+        real_metric = metric # key when looking up the metric with task.aggregation
+        if metric.endswith(decontaminate_suffix):
+            real_metric = metric.replace(decontaminate_suffix, "") # decontaminated still uses the same metric
+        results[task_name][metric] = task.aggregation()[real_metric](items)
        # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
        # so we run them less iterations. still looking for a cleaner way to do this
        stderr = lm_eval.metrics.stderr_for_metric(
-            metric=task.aggregation()[metric],
+            metric=task.aggregation()[real_metric],
            bootstrap_iters=min(bootstrap_iters, 1000) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters,
        )
        if stderr is not None:
            results[task_name][metric + "_stderr"] = stderr(items)

--- a/lm_eval/models/gpt2.py
+++ b/lm_eval/models/gpt2.py
@@ -12,9 +12,14 @@ class HFLM(BaseLM):
        assert isinstance(pretrained, str)
        assert isinstance(batch_size, int)
-        if device:
+        if device:            
+            if device not in ["cuda", "cpu"]:
+                device = int(device)
            self._device = torch.device(device)
+            print(f"Using device '{device}'")
        else:
+            print("Device not specificed")
+            print(f"Cuda Available? {torch.cuda.is_available()}")
            self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        # TODO: update this to be less of a hack once subfolder is fixed in HF

--- a/lm_eval/tasks/anli.py
+++ b/lm_eval/tasks/anli.py
@@ -66,6 +66,12 @@ class ANLIBase(Task):
        # want to do it exactly as OA did?
        return doc['premise'] + '\nQuestion: ' + doc['hypothesis'] + ' True, False, or Neither?\nAnswer:'
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["premise"]
    def doc_to_target(self, doc):
        # True = entailment
        # False = contradiction

--- a/lm_eval/tasks/arc.py
+++ b/lm_eval/tasks/arc.py
@@ -67,6 +67,12 @@ class ARCEasy(MultipleChoiceTask):
    def doc_to_text(self, doc):
        return doc["query"]
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["query"]
 class ARCChallenge(ARCEasy):
    DATASET_PATH = "ai2_arc"

--- a/lm_eval/tasks/arithmetic.py
+++ b/lm_eval/tasks/arithmetic.py
@@ -53,6 +53,12 @@ class Arithmetic(Task):
    def doc_to_text(self, doc):
        return doc["context"]
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["context"]
    def doc_to_target(self, doc):
        return doc["completion"]

--- a/lm_eval/tasks/asdiv.py
+++ b/lm_eval/tasks/asdiv.py
@@ -67,6 +67,12 @@ class Asdiv(Task):
        # TODO: add solution-type
        return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:'
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc['body'] + " " + doc['question']
    def doc_to_target(self, doc):
        # TODO: add formula

--- a/lm_eval/tasks/blimp.py
+++ b/lm_eval/tasks/blimp.py
@@ -68,6 +68,12 @@ class BlimpTask(Task):
        # this method is invoked by tests only
        return ""
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["sentence_good"] + " " + doc["sentence_bad"]
    def doc_to_target(self, doc):
        # this method is invoked by tests only
        return ""

--- a/lm_eval/tasks/cbt.py
+++ b/lm_eval/tasks/cbt.py
@@ -75,6 +75,13 @@ class CBTBase(Task):
        text = "Passage: " + passage + "\nQuestion: " + doc["question"]
        return self.detokenize(text)
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        passage = " ".join(doc["sentences"])
+        return passage
    def doc_to_target(self, doc):
        return ""

--- a/lm_eval/tasks/coqa.py
+++ b/lm_eval/tasks/coqa.py
@@ -61,6 +61,12 @@ class CoQA(Task):
            doc_text += question + answer
        return doc_text
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["story"] + " " + "\n".join(doc["questions"]["input_text"])
    @classmethod
    def get_answers(cls, doc, turn_id):
        # Returns unique answers and valid alternatives (Some questions in CoQA have multiple valid answers).

--- a/lm_eval/tasks/drop.py
+++ b/lm_eval/tasks/drop.py
@@ -107,6 +107,12 @@ class DROP(Task):
    def doc_to_text(self, doc):
        return f"Passage: {doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc['passage'] + " " + doc['question']
    def doc_to_target(self, doc):
        return " " + ", ".join(doc["answers"][0])

--- a/lm_eval/tasks/glue.py
+++ b/lm_eval/tasks/glue.py
@@ -70,6 +70,12 @@ class CoLA(Task):
    def doc_to_text(self, doc):
        return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["sentence"]
    def doc_to_target(self, doc):
        return " {}".format({1: "yes", 0: "no"}[doc["label"]])

--- a/lm_eval/tasks/headqa.py
+++ b/lm_eval/tasks/headqa.py
@@ -61,6 +61,12 @@ class HeadQABase(MultipleChoiceTask):
    def doc_to_text(self, doc):
        return doc["query"]
+    def should_decontaminate(self):
+        return True
+    def doc_to_decontamination_query(self, doc):
+        return doc["query"]
 class HeadQAEn(HeadQABase):
    DATASET_NAME = "en"
@@ -76,4 +82,4 @@ class HeadQAEsDeprecated(HeadQABase):
    def __init__(self):
        super().__init__()
        print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file