README.md 1.29 KB
Newer Older
Fabrizio Milo's avatar
Fabrizio Milo committed
1
janitor.py contains a script to remove benchmark data contamination from training data sets.
2
3
4
5
It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165).

## Algorithm
1) Collects all contamination text files that are to be removed from training data
Fabrizio Milo's avatar
Fabrizio Milo committed
6
2) Filters training data by finding `N`gram matches between the training data
7
   and any contamination
Fabrizio Milo's avatar
Fabrizio Milo committed
8
9
   1) `N`grams ignore case and punctation and are split on whitespace.
   2) Matching `N`gram substrings are removed, as is a `window_to_remove` character window around
10
11
12
13
    the match, splitting the training data into chunks
   3) Any chunks less than `minimum_slice_length` are removed
   4) Training data sets split into more than `too_dirty_cutoff` are considered
    completey contaminated and removed
Fabrizio Milo's avatar
Fabrizio Milo committed
14

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
OpenAI used:
```
ngram_n = 13
window_to_remove = 200
minimum_slice_length = 200
too_dirty_cutoff = 10
```

## Compling

Janitor can be used as a pure python program, but it is much faster if the ngram
code is run in C++. To compile the C++ code, run

```
pip install pybind11
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)
```

If your your compiler isn't linked to python, you may need to add to the above `-undefined dynamic_lookup`