Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
7defc667
Unverified
Commit
7defc667
authored
May 14, 2020
by
Lysandre Debut
Committed by
GitHub
May 14, 2020
Browse files
p_mask in SQuAD pre-processing (#4049)
* Better p_mask building * Adressing @mfuntowicz comments
parent
84894974
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
13 additions
and
9 deletions
+13
-9
src/transformers/data/processors/squad.py
src/transformers/data/processors/squad.py
+13
-9
No files found.
src/transformers/data/processors/squad.py
View file @
7defc667
...
@@ -195,18 +195,22 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
...
@@ -195,18 +195,22 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
cls_index
=
span
[
"input_ids"
].
index
(
tokenizer
.
cls_token_id
)
cls_index
=
span
[
"input_ids"
].
index
(
tokenizer
.
cls_token_id
)
# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
# Original TF implem also keep the classification token (set to 0) (not sure why...)
# Original TF implem also keep the classification token (set to 0)
p_mask
=
np
.
array
(
span
[
"token_type_ids"
])
p_mask
=
np
.
ones_like
(
span
[
"token_type_ids"
])
p_mask
=
np
.
minimum
(
p_mask
,
1
)
if
tokenizer
.
padding_side
==
"right"
:
if
tokenizer
.
padding_side
==
"right"
:
# Limit positive values to one
p_mask
[
len
(
truncated_query
)
+
sequence_added_tokens
:]
=
0
p_mask
=
1
-
p_mask
else
:
p_mask
[
-
len
(
span
[
"tokens"
])
:
-
(
len
(
truncated_query
)
+
sequence_added_tokens
)]
=
0
pad_token_indices
=
np
.
where
(
span
[
"input_ids"
]
==
tokenizer
.
pad_token_id
)
special_token_indices
=
np
.
asarray
(
tokenizer
.
get_special_tokens_mask
(
span
[
"input_ids"
],
already_has_special_tokens
=
True
)
).
nonzero
()
p_mask
[
np
.
where
(
np
.
array
(
span
[
"input_ids"
])
==
tokenizer
.
sep_token_id
)[
0
]]
=
1
p_mask
[
pad_token_indices
]
=
1
p_mask
[
special_token_indices
]
=
1
# Set the
CLS index to '0'
# Set the
cls index to 0: the CLS index can be used for impossible answers
p_mask
[
cls_index
]
=
0
p_mask
[
cls_index
]
=
0
span_is_impossible
=
example
.
is_impossible
span_is_impossible
=
example
.
is_impossible
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment