Updated readme

b08b5edc · Mostofa Patwary · a29890c3 · b08b5edc
Commit b08b5edc authored Feb 26, 2021 by Mostofa Patwary
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 3 deletions

tools/openwebtext/README.md tools/openwebtext/README.md +4 -3

No files found.
--- a/tools/openwebtext/README.md
+++ b/tools/openwebtext/README.md
@@ -46,10 +46,11 @@ shuf <cleaned deduped data file> -o train_data.json
 # Deduplicating ngrams
-To deduplicate the downstream tasks from the training dataset, we run the following command.
+To deduplicate the downstream tasks (e.g. lambada, squad) from the training dataset, we run the following command.
 ```
-python filter_ngrams.py <down stream task dataset> <training dataset to deduplicate> <output training dataset>
+python filter_ngrams.py --tasks <name of he task, e.g. lambada, squad> --dedup-dataset <training dataset to deduplicate> <json key> --output <output training dataset>
 ```
 We use 13-grams for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times.
+Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`.