decontamination.md 3.42 KB
Newer Older
researcher2's avatar
researcher2 committed
1
2
3
4
# Decontamination

## Usage

5
The provided directory should contain
researcher2's avatar
researcher2 committed
6
7
8
the ngram files and info.json produced in "Pile Ngram Generation" further down.

```bash
9
python -m lm_eval \
researcher2's avatar
researcher2 committed
10
11
    --model gpt2 \
    --device 0 \
12
    --tasks sciq
researcher2's avatar
researcher2 committed
13
14
15
```

## Background
Kiersten Stokes's avatar
Kiersten Stokes committed
16

researcher2's avatar
researcher2 committed
17
Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set, referred to as leakage or contamination.
researcher2's avatar
researcher2 committed
18

researcher2's avatar
researcher2 committed
19
Filtering your training set against the test set is a good first step, however this isn't always possible, as in the case of a new benchmark or one that wasn't considered prior to model training. When training set filtering isn't possible, it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
researcher2's avatar
researcher2 committed
20
21
22
23

The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.

## Implementation
Kiersten Stokes's avatar
Kiersten Stokes committed
24

researcher2's avatar
researcher2 committed
25
Contamination detection can be found in `lm_eval/decontaminate.py` with supporting code in `lm_eval/decontamination/`.
researcher2's avatar
researcher2 committed
26
27

decontaminate.py does the following:
Kiersten Stokes's avatar
Kiersten Stokes committed
28

researcher2's avatar
researcher2 committed
29
30
31
32
1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
2. Scan through sorted files containing training set n-grams.
3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.

researcher2's avatar
researcher2 committed
33
`lm_eval/evaluator.py` can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix.
researcher2's avatar
researcher2 committed
34
35
36
37

This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).

## Pile Ngram Generation
Kiersten Stokes's avatar
Kiersten Stokes committed
38

researcher2's avatar
researcher2 committed
39
40
The relevant scripts can be found in `scripts/clean_training_data`, which also import from
`lm_eval/decontamination/`
researcher2's avatar
researcher2 committed
41
42
43
44
45
46
47
48
49
50

1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
2. pip install -r requirements.txt
3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
5. Run generate_13_grams.

```bash
export PYTHONHASHSEED=0
python -m scripts/clean_training_data/generate_13_grams \
researcher2's avatar
researcher2 committed
51
52
53
       -dir path/to/working/directory \
       -n 13 \
       -buckets 500
researcher2's avatar
researcher2 committed
54
55
```

researcher2's avatar
researcher2 committed
56
Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing in case you need to stop and start.
researcher2's avatar
researcher2 committed
57
58

6. Sort the generated 13-grams.
Kiersten Stokes's avatar
Kiersten Stokes committed
59

researcher2's avatar
researcher2 committed
60
61
```bash
python -m scripts/clean_training_data/sort_13_gram_buckets \
researcher2's avatar
researcher2 committed
62
       -dir path/to/working/directory/output
researcher2's avatar
researcher2 committed
63
64
65
66
67
68
69
70
71
72
```

Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.

7. Compress the sorted 13 grams files and place them together with info.json.

This step only takes a few hours.

```bash
python -m scripts/clean_training_data/compress_and_package \
researcher2's avatar
researcher2 committed
73
74
75
       -dir path/to/working/directory \
       -output path/to/final/directory \
       -procs 8
researcher2's avatar
researcher2 committed
76
```