"vscode:/vscode.git/clone" did not exist on "01e355516bbeecf1588a24c75321895029ea1123"
Commit 0fa728ac authored by Mostofa Patwary's avatar Mostofa Patwary
Browse files

addressed reviews

parent f938e19a
...@@ -26,8 +26,8 @@ python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for ...@@ -26,8 +26,8 @@ python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for
``` ```
python cleanup_dataset.py <input data file> <output cleaned data filename> python cleanup_dataset.py <input data file> <output cleaned data filename>
``` ```
Addingtional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. Program arguments have the details. Additional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. More details can be found by running `python cleanup_fix_dataset.py --help`.
2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found at the arguments. 2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by `python find_duplicate.py --help`.
``` ```
python find_duplicates.py --inputs <pairlist list of input cleaned data files and keys, e.g. cc.json cc_id news.json news_id> --output <output possible duplicate urls filename> python find_duplicates.py --inputs <pairlist list of input cleaned data files and keys, e.g. cc.json cc_id news.json news_id> --output <output possible duplicate urls filename>
``` ```
...@@ -56,4 +56,4 @@ We use 13-grams by default for the deduplication. When we find a 13-gram match i ...@@ -56,4 +56,4 @@ We use 13-grams by default for the deduplication. When we find a 13-gram match i
Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`. Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`.
Several other features (e.g. save and load dictionary) have been added, look at the arguments for details. Several other features (e.g. save and load dictionary) have been added, look at `python filter_ngrams.py --help` for details.
...@@ -16,7 +16,9 @@ ...@@ -16,7 +16,9 @@
""" """
Filter and clean documents: Filter and clean documents:
Capable to clean docs with less than 512 characters, less than Capable to clean docs with less than 512 characters, less than
256 characters and contains javascript, fix text and clean text 256 characters and contains javascript, fix text and dataset specific
cleaning like stories and realnews datasets.
Program arguments have the details.
""" """
import argparse import argparse
...@@ -31,12 +33,6 @@ from pathlib import Path ...@@ -31,12 +33,6 @@ from pathlib import Path
import re import re
import time import time
"""
This code does additional cleanup, for example, remove documents less than 512
characters or dataset specific cleaning like stories, realnews datasets.
Program arguments have the details.
"""
def process_doc(json_line, args): def process_doc(json_line, args):
# Read the line. # Read the line.
...@@ -164,9 +160,14 @@ if __name__ == '__main__': ...@@ -164,9 +160,14 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--input-files', nargs = '*', required=True, default=\ parser.add_argument('--input-files', nargs = '*', required=True, default=\
None, help = 'Input json files that needs to be'\ None, help = 'Input json files that needs to be'\
' creaned') ' cleaned')
parser.add_argument('--tasks', nargs = '*', required=True, default=None,\ parser.add_argument('--tasks', nargs = '*', required=True, default=None,\
help = 'Tasks to perform on the input files') help = 'Tasks to perform on the input files, ' \
'such as remove_512, remove_256_javascript, ' \
'remove_512_non_english, ftfy_fix_text, and ' \
'general_cleaning. 256 or 512 means the number' \
' of characters.')
parser.add_argument('--output-path', type=str, default=None, parser.add_argument('--output-path', type=str, default=None,
help='Directory where the output should go') help='Directory where the output should go')
parser.add_argument('--log-interval', type=int, default=100, parser.add_argument('--log-interval', type=int, default=100,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment