Commit b08b5edc authored by Mostofa Patwary's avatar Mostofa Patwary
Browse files

Updated readme

parent a29890c3
...@@ -46,10 +46,11 @@ shuf <cleaned deduped data file> -o train_data.json ...@@ -46,10 +46,11 @@ shuf <cleaned deduped data file> -o train_data.json
# Deduplicating ngrams # Deduplicating ngrams
To deduplicate the downstream tasks from the training dataset, we run the following command. To deduplicate the downstream tasks (e.g. lambada, squad) from the training dataset, we run the following command.
``` ```
python filter_ngrams.py <down stream task dataset> <training dataset to deduplicate> <output training dataset> python filter_ngrams.py --tasks <name of he task, e.g. lambada, squad> --dedup-dataset <training dataset to deduplicate> <json key> --output <output training dataset>
``` ```
We use 13-grams for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times. We use 13-grams for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times.
Only for the lambada task, we need to provide the path, `--lambada-path <path of the lambada test data>`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment