Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
db034660
"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "b9ceb03df8a2585f3357aee715acb5c4217e9833"
Unverified
Commit
db034660
authored
May 04, 2022
by
Thomas Wang
Committed by
GitHub
May 04, 2022
Browse files
Fix hashing for deduplication (#17048)
parent
39f8eafc
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
2 additions
and
1 deletion
+2
-1
examples/research_projects/codeparrot/scripts/preprocessing.py
...les/research_projects/codeparrot/scripts/preprocessing.py
+2
-1
No files found.
examples/research_projects/codeparrot/scripts/preprocessing.py
View file @
db034660
import
gzip
import
hashlib
import
multiprocessing
import
os
import
shutil
...
...
@@ -13,7 +14,7 @@ from transformers import HfArgumentParser
def
get_hash
(
example
):
"""Get hash of content field."""
return
{
"hash"
:
hash
(
example
[
"content"
])}
return
{
"hash"
:
hash
lib
.
md5
(
example
[
"content"
]
.
strip
().
encode
(
"utf-8"
)).
hexdigest
(
)}
def
line_stats
(
example
):
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment