More Changes For PR

Add documentation for decontamination. Improvements to 13-gram generation.

More Changes For PR
Add documentation for decontamination. Improvements to 13-gram generation.
4411e788 · researcher2 · 3c0fa5a6 · 4411e788 · 4411e788 · 4411e788
Commit 4411e788 authored Mar 14, 2022 by researcher2
5 changed files
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
 ```bash
 python main.py \
 	--model gpt2 \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
 (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```

@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag \
 	--num_fewshot 2
 ```
@@ -392,6 +392,21 @@ python write_out.py \

 This will write out one text file for each task.

+### Test Set Decontamination
+
+For more details see the [decontamination guide](./docs/decontamination.md).
+
+The directory provided with the "--decontamination_ngrams_path" argument should contain
+the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
+
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \ 
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+
 ### Code Structure

 There are two major components of the library:

--- a/docs/decontamination.md
+++ b/docs/decontamination.md
+# Decontamination
+
+## Usage
+
+Simply add a decontamination_ngrams_path when running main.py. The provided directory should contain
+the ngram files and info.json produced in "Pile Ngram Generation" further down.
+
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+
+## Background
+Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
+
+As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the uncontaminated test examples and producing a clean version of the benchmark.
+
+The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
+
+## Implementation
+
+Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/". 
+
+decontaminate.py does the following:
+1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
+2. Scan through sorted files containing training set n-grams.
+3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
+
+"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix. 
+
+This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
+
+## Pile Ngram Generation
+The relevant scripts can be found in scripts/clean_training_data, which also import from
+"lm_eval/decontamination/"
+
+1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+2. pip install -r requirements.txt
+3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
+4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
+5. Run generate_13_grams.
+
+```bash
+export PYTHONHASHSEED=0
+python -m scripts/clean_training_data/generate_13_grams \
+    -dir path/to/working/directory \
+    -n 13 \
+    -buckets 500 \
+```
+
+Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
+
+6. Sort the generated 13-grams.
+```bash
+python -m scripts/clean_training_data/sort_13_gram_buckets \
+    -dir path/to/working/directory/output
+```
+
+Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
+
+7. Compress the sorted 13 grams files and place them together with info.json.
+
+This step only takes a few hours.
+
+```bash
+python -m scripts/clean_training_data/compress_and_package \
+    -dir path/to/working/directory \
+    -output path/to/final/directory \
+    -procs 8
+```
+
+Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
+
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -142,6 +142,13 @@ def doc_to_target(self, doc):

 Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.

+### Decontamination
+For background on decontamination please see [this](./decontamination.md).
+
+If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true. 
+
+You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
+
 ### Registering Your Task

 Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).

--- a/scripts/clean_training_data/compress_and_package.py
+++ b/scripts/clean_training_data/compress_and_package.py
+import glob
+import argparse
+import os
+import subprocess
+import shutil
+
+from tqdm import tqdm
+from tqdm_multiprocess import TqdmMultiProcessPool
+
+import logging
+from tqdm_multiprocess.logger import setup_logger_tqdm
+logger = logging.getLogger(__name__)
+
+def process_task(working_directory, output_directory, bucket_file_path, tqdm_func, global_tqdm):
+    command = f"zstd {bucket_file_path}"
+    logger.info(command)    
+    subprocess.call(command, shell=True)
+
+    compressed_file = bucket_file_path + ".zst"
+    if output_directory:
+        shutil.move(compressed_file, output_directory)
+
+    os.remove(bucket_file_path)
+    global_tqdm.update()
+
+def compress_and_move(working_directory, output_directory, process_count):
+    os.makedirs(output_directory, exist_ok=True)
+    original_info_file_path = os.path.join(working_directory, "info.json")
+    assert(os.path.exists(original_info_file_path))
+
+    tasks = []
+    bucket_file_paths = glob.glob(os.path.join(working_directory, "output", f"*.bkt.txt.sorted")) 
+    for bucket_file_path in bucket_file_paths:
+        task = (process_task, (working_directory, output_directory, bucket_file_path))
+        tasks.append(task)
+
+    pool = TqdmMultiProcessPool(process_count) 
+    on_done = lambda _ : None
+    on_error = lambda _ : None
+
+    global_progress = tqdm(total=len(bucket_file_paths), dynamic_ncols=True, unit="file")
+    _ = pool.map(global_progress, tasks, on_error, on_done)
+
+    shutil.copy(original_info_file_path, os.path.join(output_directory, "info.json"))
+
+parser = argparse.ArgumentParser(description='sort 13gram buckets')
+parser.add_argument("-dir", "--working_directory", required=True)
+parser.add_argument("-output", "--output_directory", required=True)
+parser.add_argument("-procs", "--process_count", type=int, default=8)
+
+if __name__ == '__main__':
+    version = 1.00
+    print(f"Running version {version}")
+
+    logfile_path = "compress_and_package.log"
+    setup_logger_tqdm(logfile_path)
+
+    args = parser.parse_args()
+    compress_and_move(args.working_directory, args.output_directory, args.process_count)
\ No newline at end of file
--- a/scripts/clean_training_data/sort_13_gram_buckets.py
+++ b/scripts/clean_training_data/sort_13_gram_buckets.py
 """
-Iteratively runs gnu sort on each bucket, gnu handles the multiprocessing.
+Iteratively runs gnu sort on each bucket, which does automatic the multiprocessing up to 8.

 Arguments
 ---------
@@ -11,10 +11,8 @@ Arguments
 import glob
 import argparse
 import os
-from pathlib import Path
 import signal
 from signal import SIGINT
-import re
 import subprocess

 from tqdm import tqdm
@@ -32,12 +30,6 @@ def sort_13_gram_buckets(working_directory):
    bucket_file_paths = glob.glob(os.path.join(working_directory, f"*.bkt.txt")) 

    for bucket_file_path in tqdm(bucket_file_paths, dynamic_ncols=True):
-        bucket_id = re.sub("\D", "", os.path.basename(bucket_file_path))
-        done_file = os.path.join(working_directory, f"ngram_bucket_sorting_{bucket_id}.done")
-        if os.path.exists(done_file):
-            logger.info(f"bucket {bucket_id} already processed, skipping")
-            return
-
        sorted_file_path = bucket_file_path + ".sorted"
        command = f"sort {bucket_file_path} > {sorted_file_path}"
        logger.info(command)    
@@ -46,7 +38,6 @@ def sort_13_gram_buckets(working_directory):
        if terminate:
            return

-        Path(done_file).touch()
        os.remove(bucket_file_path)

 parser = argparse.ArgumentParser(description='sort 13gram buckets')
@@ -54,6 +45,9 @@ parser.add_argument("-dir", "--working_directory", default="")

 if __name__ == '__main__':

+    version = 1.00
+    print(f"Running version {version}")
+
    # Handle sigint (ctrl-c) cleanly
    previous_signal_int = signal.signal(SIGINT, handler)