added more comments

c49b4644 · Mostofa Patwary · 7c3d8b7a · c49b4644 · c49b4644 · c49b4644
Commit c49b4644 authored Apr 20, 2021 by Mostofa Patwary
Showing with 13 additions and 2 deletions

tools/openwebtext/README.md tools/openwebtext/README.md +3 -2

tools/openwebtext/add_id.py tools/openwebtext/add_id.py +4 -0

tools/openwebtext/cleanup_fix_dataset.py tools/openwebtext/cleanup_fix_dataset.py +6 -0

No files found.
--- a/tools/openwebtext/README.md
+++ b/tools/openwebtext/README.md
@@ -26,7 +26,8 @@ python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for
 ```
 python cleanup_dataset.py <input data file> <output cleaned data filename>
 ```
-2. Using LSH, find possible duplicates and store then in a file for later processing. This step can NOT be sharded and usually takes 12 to 24 hours for OpenWebText dataset. The code supports saving and loading fingerprints for recurrent deduplications.
+Addingtional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. Program arguments have the details.
+2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found at the arguments.
 ```
 python find_duplicates.py --inputs <pairlist list of input cleaned data files and keys, e.g. cc.json cc_id news.json news_id> --output <output possible duplicate urls filename>
 ```
@@ -51,7 +52,7 @@ To deduplicate the downstream tasks (e.g. lambada, squad) from the training data
 ```
 python filter_ngrams.py --tasks <name of he task, e.g. lambada, squad> --dedup-dataset <training dataset to deduplicate> <json key> --output <output training dataset>
 ```
-We use 13-grams for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times.
+We use 13-grams by default for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times. These parameters can be changed using corresponding arguments.

 Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`.


--- a/tools/openwebtext/add_id.py
+++ b/tools/openwebtext/add_id.py
@@ -3,6 +3,10 @@ import json
 import os
 import time

+"""
+This code adds id to each json object in a json file. User can add prefix
+to the ids.
+"""

 if __name__ == '__main__':


--- a/tools/openwebtext/cleanup_fix_dataset.py
+++ b/tools/openwebtext/cleanup_fix_dataset.py
@@ -31,6 +31,12 @@ from pathlib import Path
 import re
 import time

+"""
+This code does additional cleanup, for example, remove documents less than 512
+characters or dataset specific cleaning like stories, realnews datasets.
+Program arguments have the details.
+"""
+
 def process_doc(json_line, args):

    # Read the line.