addressed reviews

0fa728ac · Mostofa Patwary · f938e19a · 0fa728ac · 0fa728ac
Commit 0fa728ac authored Apr 21, 2021 by Mostofa Patwary
Show whitespace changes
Inline Side-by-side

Showing with 13 additions and 12 deletions

tools/openwebtext/README.md tools/openwebtext/README.md +3 -3

tools/openwebtext/cleanup_fix_dataset.py tools/openwebtext/cleanup_fix_dataset.py +10 -9

No files found.
--- a/tools/openwebtext/README.md
+++ b/tools/openwebtext/README.md
@@ -26,8 +26,8 @@ python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for
 ```
 python cleanup_dataset.py <input data file> <output cleaned data filename>
 ```
-Addingtional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. Program arguments have the details.
+Additional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. More details can be found by running `python cleanup_fix_dataset.py --help`.
-2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found at the arguments.
+2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by `python find_duplicate.py --help`.
 ```
 python find_duplicates.py --inputs <pairlist list of input cleaned data files and keys, e.g. cc.json cc_id news.json news_id> --output <output possible duplicate urls filename>
 ```
@@ -56,4 +56,4 @@ We use 13-grams by default for the deduplication. When we find a 13-gram match i
 Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`.
-Several other features (e.g. save and load dictionary) have been added, look at the arguments for details.
+Several other features (e.g. save and load dictionary) have been added, look at `python filter_ngrams.py --help` for details.
--- a/tools/openwebtext/cleanup_fix_dataset.py
+++ b/tools/openwebtext/cleanup_fix_dataset.py
@@ -16,7 +16,9 @@
 """
 Filter and clean documents:
 Capable to clean docs with less than 512 characters, less than
-256 characters and contains javascript, fix text and clean text
+256 characters and contains javascript, fix text and dataset specific
+cleaning like stories and realnews datasets.
+Program arguments have the details.
 """
 import argparse
@@ -31,12 +33,6 @@ from pathlib import Path
 import re
 import time
-"""
-This code does additional cleanup, for example, remove documents less than 512
-characters or dataset specific cleaning like stories, realnews datasets.
-Program arguments have the details.
-"""
 def process_doc(json_line, args):
    # Read the line.
@@ -164,9 +160,14 @@ if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input-files', nargs = '*', required=True, default=\
                        None, help = 'Input json files that needs to be'\
-                        ' creaned')
+                        ' cleaned')
    parser.add_argument('--tasks', nargs = '*', required=True, default=None,\
-                        help = 'Tasks to perform on the input files')
+                        help = 'Tasks to perform on the input files, ' \
+                        'such as remove_512, remove_256_javascript, ' \
+                        'remove_512_non_english, ftfy_fix_text, and ' \
+                        'general_cleaning. 256 or 512 means the number' \
+                        ' of characters.')
    parser.add_argument('--output-path', type=str, default=None,
                       help='Directory where the output should go')
    parser.add_argument('--log-interval', type=int, default=100,