Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
7f2c384c
Commit
7f2c384c
authored
Aug 28, 2019
by
VictorSanh
Browse files
add `scripts/token_counts.py`
parent
4d16b279
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
30 additions
and
0 deletions
+30
-0
examples/distillation/scripts/token_counts.py
examples/distillation/scripts/token_counts.py
+30
-0
No files found.
examples/distillation/scripts/token_counts.py
0 → 100644
View file @
7f2c384c
from
collections
import
Counter
import
argparse
import
pickle
from
utils
import
logger
if
__name__
==
'__main__'
:
parser
=
argparse
.
ArgumentParser
(
description
=
"Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)"
)
parser
.
add_argument
(
"--data_file"
,
type
=
str
,
default
=
"data/dump.bert-base-uncased.pickle"
,
help
=
"The binarized dataset."
parser
.
add_argument
(
"--token_counts_dump"
,
type
=
str
,
default
=
"data/token_counts.bert-base-uncased.pickle"
,
help
=
"The dump file."
)
parser
.
add_argument
(
"--vocab_size"
,
default
=
30522
,
type
=
int
)
args
=
parser
.
parse_args
()
logger
.
info
(
f
'Loading data from
{
args
.
data_file
}
'
)
with
open
(
args
.
data_file
,
'rb'
)
as
fp
:
data
=
pickle
.
load
(
fp
)
logger
.
info
(
'Counting occurences for MLM.'
)
counter
=
Counter
()
for
tk_ids
in
data
:
counter
.
update
(
tk_ids
)
counts
=
[
0
]
*
args
.
vocab_size
for
k
,
v
in
counter
.
items
():
counts
[
k
]
=
v
logger
.
info
(
f
'Dump to
{
args
.
token_counts_dump
}
'
)
with
open
(
args
.
token_counts_dump
,
'wb'
)
as
handle
:
pickle
.
dump
(
counts
,
handle
,
protocol
=
pickle
.
HIGHEST_PROTOCOL
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment