README.md 1.64 KB
Newer Older
Kiersten Stokes's avatar
Kiersten Stokes committed
1
2
# Clean Training Data

Fabrizio Milo's avatar
Fabrizio Milo committed
3
janitor.py contains a script to remove benchmark data contamination from training data sets.
4
5
6
It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165).

## Algorithm
Kiersten Stokes's avatar
Kiersten Stokes committed
7

8
1) Collects all contamination text files that are to be removed from training data
Fabrizio Milo's avatar
Fabrizio Milo committed
9
2) Filters training data by finding `N`gram matches between the training data
10
   and any contamination
Fabrizio Milo's avatar
Fabrizio Milo committed
11
   1) `N`grams ignore case and punctuation and are split on whitespace.
Fabrizio Milo's avatar
Fabrizio Milo committed
12
   2) Matching `N`gram substrings are removed, as is a `window_to_remove` character window around
13
14
15
    the match, splitting the training data into chunks
   3) Any chunks less than `minimum_slice_length` are removed
   4) Training data sets split into more than `too_dirty_cutoff` are considered
16
    completely contaminated and removed
Fabrizio Milo's avatar
Fabrizio Milo committed
17

18
OpenAI used:
Kiersten Stokes's avatar
Kiersten Stokes committed
19
20

```text
21
22
23
24
25
26
ngram_n = 13
window_to_remove = 200
minimum_slice_length = 200
too_dirty_cutoff = 10
```

Fabrizio Milo's avatar
Fabrizio Milo committed
27
## Compiling
28
29
30
31

Janitor can be used as a pure python program, but it is much faster if the ngram
code is run in C++. To compile the C++ code, run

Kiersten Stokes's avatar
Kiersten Stokes committed
32
```bash
33
34
35
36
pip install pybind11
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)
```

Michael Chen's avatar
Michael Chen committed
37
38
MacOS users: If your compiler isn't linked to Python, you may need to add to the above `-undefined dynamic_lookup`. \
Linux users: If your compiler isn't linked to Python, you may need to follow these steps:
Kiersten Stokes's avatar
Kiersten Stokes committed
39

Michael Chen's avatar
Michael Chen committed
40
41
1. Rename the compiled code file to `janitor_util.so`.
2. Before running `import Janitor` in your code, add `sys.path.append("your/relative/path/to/janitor_util.so")` so that Python knows the location of `janitor_util.so`.