Unverified Commit e00d682f authored by Jonathan Tow's avatar Jonathan Tow Committed by GitHub
Browse files

Merge pull request #261 from EleutherAI/researcher2

Update CLI options and introduce decontamination
parents eb8163e9 ab6883b1
...@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca ...@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
```bash ```bash
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag --tasks lambada,hellaswag
``` ```
(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes) (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
...@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model ...@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \ --model_args pretrained=EleutherAI/gpt-neo-2.7B \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag --tasks lambada,hellaswag
``` ```
...@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model ...@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \ --model_args pretrained=EleutherAI/gpt-neo-1.3B \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag \ --tasks lambada,hellaswag \
--num_fewshot 2 --num_fewshot 2
``` ```
...@@ -392,6 +392,21 @@ python write_out.py \ ...@@ -392,6 +392,21 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
### Test Set Decontamination
For more details see the [decontamination guide](./docs/decontamination.md).
The directory provided with the "--decontamination_ngrams_path" argument should contain
the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
### Code Structure ### Code Structure
There are two major components of the library: There are two major components of the library:
......
# Decontamination
## Usage
Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
## Background
Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
## Implementation
Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/".
decontaminate.py does the following:
1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
2. Scan through sorted files containing training set n-grams.
3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix.
This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
## Pile Ngram Generation
The relevant scripts can be found in scripts/clean_training_data, which also import from
"lm_eval/decontamination/"
1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
2. pip install -r requirements.txt
3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
5. Run generate_13_grams.
```bash
export PYTHONHASHSEED=0
python -m scripts/clean_training_data/generate_13_grams \
-dir path/to/working/directory \
-n 13 \
-buckets 500
```
Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
6. Sort the generated 13-grams.
```bash
python -m scripts/clean_training_data/sort_13_gram_buckets \
-dir path/to/working/directory/output
```
Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
7. Compress the sorted 13 grams files and place them together with info.json.
This step only takes a few hours.
```bash
python -m scripts/clean_training_data/compress_and_package \
-dir path/to/working/directory \
-output path/to/final/directory \
-procs 8
```
Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
...@@ -151,6 +151,13 @@ def doc_to_target(self, doc): ...@@ -151,6 +151,13 @@ def doc_to_target(self, doc):
Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍. Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
### Decontamination
For background on decontamination please see [this](./decontamination.md).
If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true.
You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
### Registering Your Task ### Registering Your Task
Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY` dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py). Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY` dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
......
...@@ -415,6 +415,10 @@ class Task(abc.ABC): ...@@ -415,6 +415,10 @@ class Task(abc.ABC):
download_mode=download_mode download_mode=download_mode
) )
def should_decontaminate(self):
"""Whether this task supports decontamination against model training set."""
return False
@abstractmethod @abstractmethod
def has_training_docs(self): def has_training_docs(self):
"""Whether the task has a training set""" """Whether the task has a training set"""
...@@ -468,6 +472,10 @@ class Task(abc.ABC): ...@@ -468,6 +472,10 @@ class Task(abc.ABC):
return rnd.sample(self._training_docs, k) return rnd.sample(self._training_docs, k)
def doc_to_decontamination_query(self, doc):
print("Override doc_to_decontamination_query with document specific decontamination query.")
assert(False)
@abstractmethod @abstractmethod
def doc_to_text(self, doc): def doc_to_text(self, doc):
pass pass
...@@ -625,6 +633,10 @@ class MultipleChoiceTask(Task): ...@@ -625,6 +633,10 @@ class MultipleChoiceTask(Task):
class PerplexityTask(Task, abc.ABC): class PerplexityTask(Task, abc.ABC):
def should_decontaminate(self):
"""Whether this task supports decontamination against model training set."""
return True
def has_training_docs(self): def has_training_docs(self):
return False return False
...@@ -653,6 +665,9 @@ class PerplexityTask(Task, abc.ABC): ...@@ -653,6 +665,9 @@ class PerplexityTask(Task, abc.ABC):
"bits_per_byte": False, "bits_per_byte": False,
} }
def doc_to_decontamination_query(self, doc):
return doc
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "" return ""
......
...@@ -4,6 +4,9 @@ import json ...@@ -4,6 +4,9 @@ import json
import jsonlines import jsonlines
import io import io
import datetime import datetime
import mmap
import tqdm
from pathlib import Path
def json_serial(obj): def json_serial(obj):
"""JSON serializer for objects not serializable by default json code""" """JSON serializer for objects not serializable by default json code"""
...@@ -59,16 +62,19 @@ class Reader: ...@@ -59,16 +62,19 @@ class Reader:
else: else:
yield text yield text
# Simple text reader and writer with same interface as above
class TextArchive: class TextArchive:
def __init__(self, file_path, mode="ab"): def __init__(self, file_path, mode="rb+"):
self.file_path = file_path self.file_path = file_path
dir_name = os.path.dirname(file_path) dir_name = os.path.dirname(file_path)
if dir_name: if dir_name:
os.makedirs(dir_name, exist_ok=True) os.makedirs(dir_name, exist_ok=True)
self.fh = open(self.file_path, mode)
if not os.path.exists(file_path):
Path(file_path).touch()
self.fh = open(self.file_path, mode)
def add_data(self, data, meta={}): def add_data(self, data):
self.fh.write(data.encode('UTF-8') + b'\n') self.fh.write(data.encode('UTF-8') + b'\n')
def commit(self): def commit(self):
...@@ -79,12 +85,63 @@ class TextReader: ...@@ -79,12 +85,63 @@ class TextReader:
def __init__(self, file_path): def __init__(self, file_path):
self.file_path = file_path self.file_path = file_path
# Optimized mmap read with infrequent tqdm updates to maintain speed
# Tested up to 250MB/s.
def read_tqdm(self, update_frequency=10000):
current_file_position = 0
line_counter = 0
with open(self.file_path, 'r') as fh, \
tqdm.tqdm(total=os.path.getsize(self.file_path), dynamic_ncols=True,
unit="byte", unit_scale=1) as progress:
with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
for line in iter(mmap_obj.readline, b""):
line = line.decode("utf-8")
line_counter += 1
if line_counter == update_frequency:
new_file_pos = mmap_obj.tell()
bytes_read = new_file_pos - current_file_position
current_file_position = new_file_pos
progress.update(bytes_read)
line_counter = 0
yield line[:-1]
def read_and_tell(self):
current_file_position = 0
with open(self.file_path, 'r', encoding="utf8") as fh:
with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
for line in iter(mmap_obj.readline, b""):
line = line.decode("utf-8")
new_file_pos = mmap_obj.tell()
raw_bytes_read = new_file_pos - current_file_position
current_file_position = new_file_pos
yield line[:-1], raw_bytes_read
def read(self): def read(self):
with open(self.file_path, 'r', encoding="utf8") as fh: with open(self.file_path, 'r', encoding="utf8") as fh:
self.fh = fh with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
for line in iter(mmap_obj.readline, b""):
line = line.decode("utf-8")
yield line[:-1]
def read_slow(self):
with open(self.file_path, 'r', encoding="utf8") as fh:
while True: while True:
line = self.fh.readline() line = fh.readline()
if line == -1 or line == "": if line == -1 or line == "":
break break
else : else:
yield line[:-1] yield line[:-1]
\ No newline at end of file
# Optimized for speed. Decompresses the archive in shell before
# using the mmap'd TextReader.
class ZStdTextReader:
def __init__(self, file):
self.file = file
def read_tqdm(self):
decompressed_file = self.file[:-4]
print("Decompressing file, please wait...")
os.system(f"zstd -d {self.file}") # linux decompress is faster
reader = TextReader(decompressed_file)
yield from reader.read_tqdm()
os.remove(decompressed_file)
\ No newline at end of file
import time
import random
import pickle
import json
import glob
import os
import collections
from .janitor import Janitor, word_ngrams
from .archiver import ZStdTextReader
# Was used for testing the evaluator decoupled from the full logic below
def get_train_overlap_stub(docs, ngrams_path, ngrams_n_size):
simulated_overlap = 0.1
contaminated = int(len(docs) * simulated_overlap)
return random.sample(range(len(docs)), contaminated)
# Returns a dictionary containing all overlapping documents in each
# task. In the standard use case, an overlap occurs when any of the 13-grams
# found in the task document exist in the training set documents.
#
# To generate 13-grams for the pile see scripts/clean_training_data. The final output of these
# scripts are an info.json file containing the n_gram_size (13) and a bunch of "ngrams_{x}.bkt.txt.sorted.zst"
# files. These should exist in the "ngrams_path" provided to this function.
# Algorithm:
# 1. Build lookups for each dataset {ngram: list(document_ids)}
# 2. Merge into an overall lookup {ngram: [(task_name, task_set, doc_ids),]}
# 3. Full scan the 13-grams from the training set against the merged lookup,
# saving matches in the "duplicates" dictionary {(task_name, task_set): set(doc_ids)}
# 4. Strip the task_set from the dictionary keys and return
#
# We cache the task+set lookups as well as the overlaps.
def get_train_overlap(docs_by_task_set, ngrams_path, limit):
# return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size)
info_dict_path = os.path.join(ngrams_path, "info.json")
info_dict = json.load(open(info_dict_path, "r"))
ngrams_n_size = info_dict["ngram_size"]
janitor = Janitor()
# Build lookup for each dataset first in case we use different task combinations later
print("Building Lookups...")
start = time.perf_counter()
def get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit):
return f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.overlaps"
lookups = {}
duplicates = {} # (task_name, task_set): set(doc_ids)}
sets_to_decontaminate = len(docs_by_task_set.keys())
for (task_name, task_set), docs in docs_by_task_set.items():
if not os.path.exists(f"data/{task_name}"):
os.mkdir(f"data/{task_name}")
# Check if we've decontaminated this combination before
overlaps_dump_path = get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit)
if os.path.exists(overlaps_dump_path):
duplicates[(task_name, task_set)] = pickle.load(open(overlaps_dump_path, "rb"))
sets_to_decontaminate -= 1
continue
else:
duplicates[(task_name, task_set)] = set()
# Build/load the task lookup {ngram: set(documents)}.
task_set_lookup_path = f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.lookup"
if os.path.exists(task_set_lookup_path):
print(f"{task_set_lookup_path} available, loading...")
lookups[(task_name, task_set)] = pickle.load(open(task_set_lookup_path, "rb"))
else:
print(f"{task_set_lookup_path} not available, building...")
lookup = collections.defaultdict(set)
for doc_id, document in enumerate(docs):
ngrams = word_ngrams(janitor.normalize_string(document), ngrams_n_size)
for ngram in ngrams:
lookup[ngram].add(doc_id)
pickle.dump(lookup, open(task_set_lookup_path,"wb"))
lookups[(task_name, task_set)] = lookup
elapsed = time.perf_counter() - start
print(f"Building lookups took {elapsed:0.5f} seconds.")
matched_ngrams = []
if sets_to_decontaminate > 0:
print("Merging lookups...")
start = time.perf_counter()
merged_lookup = collections.defaultdict(list)
for (task_name, task_set), lookup in lookups.items():
for ngram, doc_ids in lookup.items():
merged_lookup[ngram].append((task_name, task_set, doc_ids))
elapsed = time.perf_counter() - start
print(f"Merging lookups took {elapsed:0.5f} seconds.")
print(f"{ngrams_n_size} grams files found in {ngrams_path}:")
files = glob.glob(os.path.join(ngrams_path, f"*.sorted.zst"))
print(files)
for file in files:
start = time.perf_counter()
print(f"Scanning {file}")
reader = ZStdTextReader(file)
total_ngrams = 0
unique_ngrams = 0
matching_unique = 0
non_matching_unique = 0
current_ngram = ""
for line in reader.read_tqdm(): # Scan training set ngrams file
total_ngrams += 1
[ngram, document_id] = line.rsplit(" ", 1)
if ngram != current_ngram: # Only need to match the ngram once in training set
unique_ngrams += 1
current_ngram = ngram
if ngram in merged_lookup:
matched_ngrams.append(ngram) # For logging
matching_unique += 1
for task_name, task_set, doc_ids in merged_lookup[ngram]:
task_doc_set = duplicates[(task_name, task_set)]
for doc_id in doc_ids: # Record contamination across all relevant task/set combos
task_doc_set.add(doc_id)
del merged_lookup[ngram] # No point matching again
else:
non_matching_unique += 1
print(f"Total Ngrams: {total_ngrams}")
print(f"Unique Ngrams: {unique_ngrams}")
print(f"Unique Matching: {matching_unique}")
print(f"Unique Non Matching: {non_matching_unique}")
print("Matched ngrams:")
for ngram in matched_ngrams:
print(ngram)
elapsed = time.perf_counter() - start
print(f"Read took {elapsed:0.5f} seconds.")
print(f"Speed: {(os.path.getsize(file)/1000000.0)/elapsed}MB/second")
print(duplicates)
# Dump overlaps separately
for (task_name, task_set), doc_ids in duplicates.items():
overlaps_dump_path = get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit)
pickle.dump(doc_ids, open(overlaps_dump_path,"wb"))
# Strip task set and return
return {task_name: doc_ids for (task_name, task_set), doc_ids in duplicates.items()}
...@@ -6,15 +6,18 @@ import lm_eval.metrics ...@@ -6,15 +6,18 @@ import lm_eval.metrics
import lm_eval.models import lm_eval.models
import lm_eval.tasks import lm_eval.tasks
import lm_eval.base import lm_eval.base
import lm_eval.decontamination
import numpy as np import numpy as np
from lm_eval.utils import positional_deprecated, run_task_tests from lm_eval.utils import positional_deprecated, run_task_tests
from lm_eval.decontamination.decontaminate import get_train_overlap
@positional_deprecated @positional_deprecated
def simple_evaluate(model, model_args=None, tasks=[], def simple_evaluate(model, model_args=None, tasks=[],
num_fewshot=0, batch_size=None, device=None, num_fewshot=0, batch_size=None, device=None,
no_cache=False, limit=None, bootstrap_iters=100000, no_cache=False, limit=None, bootstrap_iters=100000,
description_dict=None, check_integrity=False): description_dict=None, check_integrity=False,
decontamination_ngrams_path=None):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM] :param model: Union[str, LM]
...@@ -72,7 +75,8 @@ def simple_evaluate(model, model_args=None, tasks=[], ...@@ -72,7 +75,8 @@ def simple_evaluate(model, model_args=None, tasks=[],
task_dict=task_dict, task_dict=task_dict,
num_fewshot=num_fewshot, num_fewshot=num_fewshot,
limit=limit, limit=limit,
description_dict=description_dict description_dict=description_dict,
decontamination_ngrams_path=decontamination_ngrams_path,
) )
# add info about the model and few shot config # add info about the model and few shot config
...@@ -90,9 +94,11 @@ def simple_evaluate(model, model_args=None, tasks=[], ...@@ -90,9 +94,11 @@ def simple_evaluate(model, model_args=None, tasks=[],
return results return results
decontaminate_suffix = "_decontaminate"
@positional_deprecated @positional_deprecated
def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None): def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None,
decontamination_ngrams_path=None):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param lm: obj :param lm: obj
...@@ -120,6 +126,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -120,6 +126,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
# nudge people to not specify it at all # nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict") print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
decontaminate = decontamination_ngrams_path is not None
task_dict_items = [ task_dict_items = [
(name, task) (name, task)
for name, task in task_dict.items() for name, task in task_dict.items()
...@@ -132,6 +140,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -132,6 +140,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
requests = collections.defaultdict(list) requests = collections.defaultdict(list)
requests_origin = collections.defaultdict(list) requests_origin = collections.defaultdict(list)
overlaps = collections.defaultdict(list) # {task_name: contaminated_docs}
# If we ever run into issues where the eval tasks don't fit in memory and we can't afford a machine with bigger # If we ever run into issues where the eval tasks don't fit in memory and we can't afford a machine with bigger
# memory, we can always modify this plumbing to support that, but I didn't want to include it just yet because # memory, we can always modify this plumbing to support that, but I didn't want to include it just yet because
# over-engineering is bad (or we could make it write the requests to disk and then read them back out again # over-engineering is bad (or we could make it write the requests to disk and then read them back out again
...@@ -140,6 +150,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -140,6 +150,8 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
# TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable # TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable
docs = {} docs = {}
docs_for_decontamination = collections.defaultdict(list)
# get lists of each type of request # get lists of each type of request
for task_name, task in task_dict_items: for task_name, task in task_dict_items:
versions[task_name] = task.VERSION versions[task_name] = task.VERSION
...@@ -147,7 +159,9 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -147,7 +159,9 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
# TODO: the test-fallback-to-val system isn't final, we should revisit it at some point # TODO: the test-fallback-to-val system isn't final, we should revisit it at some point
if task.has_test_docs(): if task.has_test_docs():
task_doc_func = task.test_docs task_doc_func = task.test_docs
task_set = "test" # Required for caching in the decontamination
elif task.has_validation_docs(): elif task.has_validation_docs():
task_set = "val" # Required for caching in the decontamination
task_doc_func = task.validation_docs task_doc_func = task.validation_docs
else: else:
raise RuntimeError("Task has neither test_docs nor validation_docs") raise RuntimeError("Task has neither test_docs nor validation_docs")
...@@ -161,6 +175,10 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -161,6 +175,10 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
description = description_dict[task_name] if description_dict and task_name in description_dict else "" description = description_dict[task_name] if description_dict and task_name in description_dict else ""
for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)): for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
if decontaminate and task.should_decontaminate():
docs_for_decontamination[(task_name, task_set)].append(task.doc_to_decontamination_query(doc))
docs[(task_name, doc_id)] = doc docs[(task_name, doc_id)] = doc
ctx = task.fewshot_context( ctx = task.fewshot_context(
doc=doc, doc=doc,
...@@ -177,6 +195,11 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -177,6 +195,11 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
# doc_id: unique id that we can get back to a doc using `docs` # doc_id: unique id that we can get back to a doc using `docs`
requests_origin[req.request_type].append((i, task_name, doc, doc_id)) requests_origin[req.request_type].append((i, task_name, doc, doc_id))
# Compare all tasks/sets at once to ensure a single training set scan
if decontaminate:
print("Finding train/test overlap, please wait...")
overlaps = get_train_overlap(docs_for_decontamination, decontamination_ngrams_path, limit)
# all responses for each (task, doc) # all responses for each (task, doc)
process_res_queue = collections.defaultdict(list) process_res_queue = collections.defaultdict(list)
...@@ -207,18 +230,28 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, ...@@ -207,18 +230,28 @@ def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None,
metrics = task.process_results(doc, requests) metrics = task.process_results(doc, requests)
for metric, value in metrics.items(): for metric, value in metrics.items():
vals[(task_name, metric)].append(value) vals[(task_name, metric)].append(value)
# Re-use the evaluation for the decontaminated set by just ignoring the overlaps
if decontaminate and task_name in overlaps:
if doc_id not in overlaps[task_name]:
vals[(task_name, metric + decontaminate_suffix)].append(value)
# aggregate results # aggregate results
for (task_name, metric), items in vals.items(): for (task_name, metric), items in vals.items():
task = task_dict[task_name] task = task_dict[task_name]
results[task_name][metric] = task.aggregation()[metric](items) real_metric = metric # key when looking up the metric with task.aggregation
if metric.endswith(decontaminate_suffix):
real_metric = metric.replace(decontaminate_suffix, "") # decontaminated still uses the same metric
results[task_name][metric] = task.aggregation()[real_metric](items)
# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this # so we run them less iterations. still looking for a cleaner way to do this
stderr = lm_eval.metrics.stderr_for_metric( stderr = lm_eval.metrics.stderr_for_metric(
metric=task.aggregation()[metric], metric=task.aggregation()[real_metric],
bootstrap_iters=min(bootstrap_iters, 1000) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters, bootstrap_iters=min(bootstrap_iters, 1000) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters,
) )
if stderr is not None: if stderr is not None:
results[task_name][metric + "_stderr"] = stderr(items) results[task_name][metric + "_stderr"] = stderr(items)
......
...@@ -12,9 +12,14 @@ class HFLM(BaseLM): ...@@ -12,9 +12,14 @@ class HFLM(BaseLM):
assert isinstance(pretrained, str) assert isinstance(pretrained, str)
assert isinstance(batch_size, int) assert isinstance(batch_size, int)
if device: if device:
if device not in ["cuda", "cpu"]:
device = int(device)
self._device = torch.device(device) self._device = torch.device(device)
print(f"Using device '{device}'")
else: else:
print("Device not specificed")
print(f"Cuda Available? {torch.cuda.is_available()}")
self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# TODO: update this to be less of a hack once subfolder is fixed in HF # TODO: update this to be less of a hack once subfolder is fixed in HF
......
...@@ -66,6 +66,12 @@ class ANLIBase(Task): ...@@ -66,6 +66,12 @@ class ANLIBase(Task):
# want to do it exactly as OA did? # want to do it exactly as OA did?
return doc['premise'] + '\nQuestion: ' + doc['hypothesis'] + ' True, False, or Neither?\nAnswer:' return doc['premise'] + '\nQuestion: ' + doc['hypothesis'] + ' True, False, or Neither?\nAnswer:'
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["premise"]
def doc_to_target(self, doc): def doc_to_target(self, doc):
# True = entailment # True = entailment
# False = contradiction # False = contradiction
......
...@@ -67,6 +67,12 @@ class ARCEasy(MultipleChoiceTask): ...@@ -67,6 +67,12 @@ class ARCEasy(MultipleChoiceTask):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["query"]
class ARCChallenge(ARCEasy): class ARCChallenge(ARCEasy):
DATASET_PATH = "ai2_arc" DATASET_PATH = "ai2_arc"
......
...@@ -53,6 +53,12 @@ class Arithmetic(Task): ...@@ -53,6 +53,12 @@ class Arithmetic(Task):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["context"] return doc["context"]
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["context"]
def doc_to_target(self, doc): def doc_to_target(self, doc):
return doc["completion"] return doc["completion"]
......
...@@ -67,6 +67,12 @@ class Asdiv(Task): ...@@ -67,6 +67,12 @@ class Asdiv(Task):
# TODO: add solution-type # TODO: add solution-type
return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:' return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:'
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc['body'] + " " + doc['question']
def doc_to_target(self, doc): def doc_to_target(self, doc):
# TODO: add formula # TODO: add formula
......
...@@ -68,6 +68,12 @@ class BlimpTask(Task): ...@@ -68,6 +68,12 @@ class BlimpTask(Task):
# this method is invoked by tests only # this method is invoked by tests only
return "" return ""
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["sentence_good"] + " " + doc["sentence_bad"]
def doc_to_target(self, doc): def doc_to_target(self, doc):
# this method is invoked by tests only # this method is invoked by tests only
return "" return ""
......
...@@ -75,6 +75,13 @@ class CBTBase(Task): ...@@ -75,6 +75,13 @@ class CBTBase(Task):
text = "Passage: " + passage + "\nQuestion: " + doc["question"] text = "Passage: " + passage + "\nQuestion: " + doc["question"]
return self.detokenize(text) return self.detokenize(text)
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
passage = " ".join(doc["sentences"])
return passage
def doc_to_target(self, doc): def doc_to_target(self, doc):
return "" return ""
......
...@@ -61,6 +61,12 @@ class CoQA(Task): ...@@ -61,6 +61,12 @@ class CoQA(Task):
doc_text += question + answer doc_text += question + answer
return doc_text return doc_text
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["story"] + " " + "\n".join(doc["questions"]["input_text"])
@classmethod @classmethod
def get_answers(cls, doc, turn_id): def get_answers(cls, doc, turn_id):
# Returns unique answers and valid alternatives (Some questions in CoQA have multiple valid answers). # Returns unique answers and valid alternatives (Some questions in CoQA have multiple valid answers).
......
...@@ -107,6 +107,12 @@ class DROP(Task): ...@@ -107,6 +107,12 @@ class DROP(Task):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return f"Passage: {doc['passage']}\nQuestion: {doc['question']}\nAnswer:" return f"Passage: {doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc['passage'] + " " + doc['question']
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " " + ", ".join(doc["answers"][0]) return " " + ", ".join(doc["answers"][0])
......
...@@ -70,6 +70,12 @@ class CoLA(Task): ...@@ -70,6 +70,12 @@ class CoLA(Task):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"]) return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["sentence"]
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " {}".format({1: "yes", 0: "no"}[doc["label"]]) return " {}".format({1: "yes", 0: "no"}[doc["label"]])
......
...@@ -61,6 +61,12 @@ class HeadQABase(MultipleChoiceTask): ...@@ -61,6 +61,12 @@ class HeadQABase(MultipleChoiceTask):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["query"]
class HeadQAEn(HeadQABase): class HeadQAEn(HeadQABase):
DATASET_NAME = "en" DATASET_NAME = "en"
...@@ -76,4 +82,4 @@ class HeadQAEsDeprecated(HeadQABase): ...@@ -76,4 +82,4 @@ class HeadQAEsDeprecated(HeadQABase):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.") print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment