Unverified Commit 286a18fa authored by Loubna Ben Allal's avatar Loubna Ben Allal Committed by GitHub
Browse files

Fix codeparrot deduplication - ignore whitespaces (#18023)

* ignore whitspaces for hash

* reformat code

* Update README.md
parent 5d1fed07
...@@ -39,7 +39,7 @@ The source of the dataset is the GitHub dump available on Google's [BigQuery](ht ...@@ -39,7 +39,7 @@ The source of the dataset is the GitHub dump available on Google's [BigQuery](ht
### Preprocessing ### Preprocessing
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones: The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
- exact deduplication using each file's hash - exact deduplication using each file's hash after having removed whistespaces.
- near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description. - near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description.
- filtering files with max line length > 1000 - filtering files with max line length > 1000
- filtering files with mean line length > 100 - filtering files with mean line length > 100
......
...@@ -3,6 +3,7 @@ import hashlib ...@@ -3,6 +3,7 @@ import hashlib
import json import json
import multiprocessing import multiprocessing
import os import os
import re
import shutil import shutil
import time import time
from pathlib import Path from pathlib import Path
...@@ -15,9 +16,12 @@ from minhash_deduplication import deduplicate_dataset ...@@ -15,9 +16,12 @@ from minhash_deduplication import deduplicate_dataset
from transformers import AutoTokenizer, HfArgumentParser from transformers import AutoTokenizer, HfArgumentParser
PATTERN = re.compile(r"\s+")
def get_hash(example): def get_hash(example):
"""Get hash of content field.""" """Get hash of content field."""
return {"hash": hashlib.md5(example["content"].strip().encode("utf-8")).hexdigest()} return {"hash": hashlib.md5(re.sub(PATTERN, "", example["content"]).encode("utf-8")).hexdigest()}
def line_stats(example): def line_stats(example):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment