Unverified Commit b08259a1 authored by Funtowicz Morgan's avatar Funtowicz Morgan Committed by GitHub
Browse files

run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991)



* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>

* run_ner.py - Do not add a label to the labels_ids if word_tokens is empty.

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.
Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
parent f4f49468
...@@ -112,12 +112,15 @@ def convert_examples_to_features( ...@@ -112,12 +112,15 @@ def convert_examples_to_features(
label_ids = [] label_ids = []
for word, label in zip(example.words, example.labels): for word, label in zip(example.words, example.labels):
word_tokens = tokenizer.tokenize(word) word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)) if len(word_tokens) > 0:
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa. # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
special_tokens_count = 3 if sep_token_extra else 2 special_tokens_count = tokenizer.num_added_tokens()
if len(tokens) > max_seq_length - special_tokens_count: if len(tokens) > max_seq_length - special_tokens_count:
tokens = tokens[: (max_seq_length - special_tokens_count)] tokens = tokens[: (max_seq_length - special_tokens_count)]
label_ids = label_ids[: (max_seq_length - special_tokens_count)] label_ids = label_ids[: (max_seq_length - special_tokens_count)]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment