Commit 4411e788 authored by researcher2's avatar researcher2
Browse files

More Changes For PR

Add documentation for decontamination.

Improvements to 13-gram generation.
parent 3c0fa5a6
......@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
```bash
python main.py \
--model gpt2 \
--device cuda:0 \
--device 0 \
--tasks lambada,hellaswag
```
(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
......@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \
--device cuda:0 \
--device 0 \
--tasks lambada,hellaswag
```
......@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \
--device cuda:0 \
--device 0 \
--tasks lambada,hellaswag \
--num_fewshot 2
```
......@@ -392,6 +392,21 @@ python write_out.py \
This will write out one text file for each task.
### Test Set Decontamination
For more details see the [decontamination guide](./docs/decontamination.md).
The directory provided with the "--decontamination_ngrams_path" argument should contain
the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
### Code Structure
There are two major components of the library:
......
# Decontamination
## Usage
Simply add a decontamination_ngrams_path when running main.py. The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
## Background
Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the uncontaminated test examples and producing a clean version of the benchmark.
The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
## Implementation
Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/".
decontaminate.py does the following:
1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
2. Scan through sorted files containing training set n-grams.
3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix.
This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
## Pile Ngram Generation
The relevant scripts can be found in scripts/clean_training_data, which also import from
"lm_eval/decontamination/"
1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
2. pip install -r requirements.txt
3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
5. Run generate_13_grams.
```bash
export PYTHONHASHSEED=0
python -m scripts/clean_training_data/generate_13_grams \
-dir path/to/working/directory \
-n 13 \
-buckets 500 \
```
Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
6. Sort the generated 13-grams.
```bash
python -m scripts/clean_training_data/sort_13_gram_buckets \
-dir path/to/working/directory/output
```
Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
7. Compress the sorted 13 grams files and place them together with info.json.
This step only takes a few hours.
```bash
python -m scripts/clean_training_data/compress_and_package \
-dir path/to/working/directory \
-output path/to/final/directory \
-procs 8
```
Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
......@@ -142,6 +142,13 @@ def doc_to_target(self, doc):
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
### Decontamination
For background on decontamination please see [this](./decontamination.md).
If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true.
You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
### Registering Your Task
Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY` dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
......
import glob
import argparse
import os
import subprocess
import shutil
from tqdm import tqdm
from tqdm_multiprocess import TqdmMultiProcessPool
import logging
from tqdm_multiprocess.logger import setup_logger_tqdm
logger = logging.getLogger(__name__)
def process_task(working_directory, output_directory, bucket_file_path, tqdm_func, global_tqdm):
command = f"zstd {bucket_file_path}"
logger.info(command)
subprocess.call(command, shell=True)
compressed_file = bucket_file_path + ".zst"
if output_directory:
shutil.move(compressed_file, output_directory)
os.remove(bucket_file_path)
global_tqdm.update()
def compress_and_move(working_directory, output_directory, process_count):
os.makedirs(output_directory, exist_ok=True)
original_info_file_path = os.path.join(working_directory, "info.json")
assert(os.path.exists(original_info_file_path))
tasks = []
bucket_file_paths = glob.glob(os.path.join(working_directory, "output", f"*.bkt.txt.sorted"))
for bucket_file_path in bucket_file_paths:
task = (process_task, (working_directory, output_directory, bucket_file_path))
tasks.append(task)
pool = TqdmMultiProcessPool(process_count)
on_done = lambda _ : None
on_error = lambda _ : None
global_progress = tqdm(total=len(bucket_file_paths), dynamic_ncols=True, unit="file")
_ = pool.map(global_progress, tasks, on_error, on_done)
shutil.copy(original_info_file_path, os.path.join(output_directory, "info.json"))
parser = argparse.ArgumentParser(description='sort 13gram buckets')
parser.add_argument("-dir", "--working_directory", required=True)
parser.add_argument("-output", "--output_directory", required=True)
parser.add_argument("-procs", "--process_count", type=int, default=8)
if __name__ == '__main__':
version = 1.00
print(f"Running version {version}")
logfile_path = "compress_and_package.log"
setup_logger_tqdm(logfile_path)
args = parser.parse_args()
compress_and_move(args.working_directory, args.output_directory, args.process_count)
\ No newline at end of file
"""
Iteratively runs gnu sort on each bucket, gnu handles the multiprocessing.
Iteratively runs gnu sort on each bucket, which does automatic the multiprocessing up to 8.
Arguments
---------
......@@ -11,10 +11,8 @@ Arguments
import glob
import argparse
import os
from pathlib import Path
import signal
from signal import SIGINT
import re
import subprocess
from tqdm import tqdm
......@@ -32,12 +30,6 @@ def sort_13_gram_buckets(working_directory):
bucket_file_paths = glob.glob(os.path.join(working_directory, f"*.bkt.txt"))
for bucket_file_path in tqdm(bucket_file_paths, dynamic_ncols=True):
bucket_id = re.sub("\D", "", os.path.basename(bucket_file_path))
done_file = os.path.join(working_directory, f"ngram_bucket_sorting_{bucket_id}.done")
if os.path.exists(done_file):
logger.info(f"bucket {bucket_id} already processed, skipping")
return
sorted_file_path = bucket_file_path + ".sorted"
command = f"sort {bucket_file_path} > {sorted_file_path}"
logger.info(command)
......@@ -46,7 +38,6 @@ def sort_13_gram_buckets(working_directory):
if terminate:
return
Path(done_file).touch()
os.remove(bucket_file_path)
parser = argparse.ArgumentParser(description='sort 13gram buckets')
......@@ -54,6 +45,9 @@ parser.add_argument("-dir", "--working_directory", default="")
if __name__ == '__main__':
version = 1.00
print(f"Running version {version}")
# Handle sigint (ctrl-c) cleanly
previous_signal_int = signal.signal(SIGINT, handler)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment