README.md 1.29 KB
Newer Older
bzantium's avatar
bzantium committed
1
janitor.py contains a script to remove benchmark data contamination from training data sets.
2
3
4
5
It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165).

## Algorithm
1) Collects all contamination text files that are to be removed from training data
bzantium's avatar
bzantium committed
6
2) Filters training data by finding `N`gram matches between the training data
7
   and any contamination
bzantium's avatar
bzantium committed
8
9
   1) `N`grams ignore case and punctuation and are split on whitespace.
   2) Matching `N`gram substrings are removed, as is a `window_to_remove` character window around
10
11
12
13
    the match, splitting the training data into chunks
   3) Any chunks less than `minimum_slice_length` are removed
   4) Training data sets split into more than `too_dirty_cutoff` are considered
    completey contaminated and removed
bzantium's avatar
bzantium committed
14

15
16
17
18
19
20
21
22
OpenAI used:
```
ngram_n = 13
window_to_remove = 200
minimum_slice_length = 200
too_dirty_cutoff = 10
```

bzantium's avatar
bzantium committed
23
## Compiling
24
25
26
27
28
29
30
31
32
33

Janitor can be used as a pure python program, but it is much faster if the ngram
code is run in C++. To compile the C++ code, run

```
pip install pybind11
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)
```

If your your compiler isn't linked to python, you may need to add to the above `-undefined dynamic_lookup`